Forecast

STATIONARY AND NON-STATIONARY TIME SERIES PREDICTION USING
STATE SPACE MODEL AND PATTERN-BASED APPROACH
by
KIN MING KAM
Presented to the Faculty of the Graduate School of
The University of Texas at Arlington in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY
THE UNIVERSITY OF TEXAS AT ARLINGTON
December 2014
Copyright
c by KIN MING KAM 2014
All Rights Reserved

To my mother, Kwai Fan Yip, my father, Yat On Kam, and my wife, Ka Yee Mak.
ACKNOWLEDGEMENTS
I would like to thank my advisors Dr. Li Zeng and Dr. Shouyi Wang for their
tremendous efforts on training me, constantly motivating, encouraging and challeng-
ing me, and also for all of their invaluable guidance and the supports during the
course of my PhD study.
I wish to thank Dr. Victoria Chen for her consistent support since the very
beginning of my PhD study and for her interest in my research and taking time to
serve in my dissertation committee.
Also, I am grateful to all the teachers who have taught me during the years I
have spent in UT Arlington, especially to Dr Jay Rosenburger and Dr Corley. I would
also like to thank my classmates whom I studied together, did projects together and
shared all the precious moment together. And, I would like to thank the adminis-
trative staff members in the office who did excellent work during the course of my
study.
Last but not least, I would like to thank my parents and my wife for their love
and patience so that I could focus on my study.
November 20, 2014
iv
ABSTRACT
STATIONARY AND NON-STATIONARY TIME SERIES PREDICTION USING
STATE SPACE MODEL AND PATTERN-BASED APPROACH
KIN MING KAM, Ph.D.
The University of Texas at Arlington, 2014
Supervising Professors: Li Zeng, Shouyi Wang
The motion-adaptive radiotherapy techniques are promising to deliver ablative
radiation doses to tumor with minimal normal tissue exposure by accounting for
real-time tumor movement. However, a major challenge of successful applications of
these techniques is the real-time prediction of target motion to accommodate system
delivery latencies. Predicting respiratory motion in real-time is challenging. The
current respiratory motion prediction approaches are still not satisfactory in terms
of accuracy and interpretability. Therefore, we propose a novel respiratory motion
prediction approach based on future values of best-matching patterns. In particular,
there are three major ingredients of this approach: (1) construct a real-time accumu-
lated pattern library by orthogonal polynomial approximation using a sliding window
approach, (2) nd k nearest-neighbor patterns in the pattern library and apply a two-
step approach to screen out the disturbing patterns and nd out the nal predictive
patterns. (3) the nal prediction is made using the bootstrapped mean of the future
values of the selected predictive patterns given a prediction horizon. Based on a study
of respiratory motion traces of 27 patients with lung cancer, the proposed prediction
v
approach has generated consistently signicant higher accuracies than the current res-
piratory motion prediction approaches, particularly for long prediction lengths.
There has been much interest in the beneficial effects of musical training on
cognition. Previous studies have indicated that musical training was related to better
working memory and that these behavioral differences were associated with differences
in neural activity in the brain. However, it was not clear whether musical training
impacts memory in general, beyond working memory. A comprehensive EEG pattern
study has been performed, including various univariate and multivariate features,
time-frequency (wavelet) analysis, power-spectra analysis, and deterministic chaotic
theory. The advanced feature selection approaches have also been employed to select
the most discriminative EEG and brain activation features between musicians and
non-musicians. High classification accuracy (more than 95%) in memory judgments
was achieved using Proximal Support Vector Machine (PSVM). For working memory,
it showed significant differences between musicians versus non-musicians during the
delay period. For long-term memory, significant differences on EEG patterns between
groups were found both in the pre-stimulus period and the post-stimulus period on
recognition. These results indicate that musicians memorial advantage occurs in
both working memory and long-term memory and that the developed computational
framework using advanced data mining techniques can be successfully applied to
classify complex human cognition with high time resolution.
vi
TABLE OF CONTENTS
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF ILLUSTRATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapter Page
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objectives and Challenges . . . . . . . . . . . . . . . . . . . 3
1.2.1 Demand Forecasting in Service Industries . . . . . . . . . . . . 3
1.2.2 Problem and Challenges . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Pattern-Based Online Prediction of Semi-periodic and Nonsta-
tionary Time Series . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 10
2. ARIMA and Dynamic Linear Model for Time Series Forecasting . . . . . . 12
2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 ARIMA and Dynamic Linear Model . . . . . . . . . . . . . . . . . . . 15
2.2.1 ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Dynamic Linear Models . . . . . . . . . . . . . . . . . . . . . 24
2.3 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Scenario Design and Computation Procedure . . . . . . . . . . 29
2.4.2 Concerns To Address . . . . . . . . . . . . . . . . . . . . . . . 31
vii
2.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.4 Summary of Numerical Study . . . . . . . . . . . . . . . . . . 39
2.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6 Summary And Future Work . . . . . . . . . . . . . . . . . . . . . . . 45
3. Pattern-Based Real-Time Prediction Of Semi-Periodic And Nonstationary
Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Time-Varying Seasonal Autoregression (TVSAR) . . . . . . . 51
3.2.2 Wavelet-Based Multiscale Autoregression . . . . . . . . . . . 54
3.3 Pattern-Based Variant Best-Neighbors Prediction Using Raw Data . . 57
3.3.1 Personalized Pattern Monitoring Window Size . . . . . . . . . 60
3.3.2 Variant Best-Neighbors-Based Predictive Pattern Selection . . 64
3.3.3 Online Prediction Frameworks Using the Selected Predictive
Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.4 Comparison for the Prediction Performance of RPKM and Some
State-Of-The-Art Methods . . . . . . . . . . . . . . . . . . . . 76
3.4 Pattern-Based Variant-Best-Neighbors Prediction Using Orthogonal
Polynomials Approximated Respiratory Motion Time Series . . . . . 80
3.4.1 Orthogonal Polynomials Appximation . . . . . . . . . . . . . . 82
3.4.2 Prediction Results of RPKM and OPPRED . . . . . . . . . . 90
3.4.3 Weighted Orthogonal Polynomials Approximations . . . . . . 95
3.4.4 Weighted Time Series Pattern Matching . . . . . . . . . . . . 96
3.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5.1 Future Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 101
viii
4. Pattern Recognition and Classification of Multivariate Time Series Signals:
EEG Study of Musicians and Non-Musicians . . . . . . . . . . . . . . . . . 103
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.1 Data Acquisition and Experimental Settings . . . . . . . . . . 104
4.2.2 Artifacts Removal . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.3 Signal Feature Extraction . . . . . . . . . . . . . . . . . . . . 108
4.2.4 Feature Vector Classification Using Proximal Support Vector
Machine (PSVM) . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3 Result Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.4 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 121
5. Conclusions and Future Research . . . . . . . . . . . . . . . . . . . . . . . 126
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
BIOGRAPHICAL STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . 137
ix
LIST OF ILLUSTRATIONS
Figure Page
1.1 Examples of complex seasonality showing (a) non-integer seasonal peri-
ods, (b) multiple nested seasonal periods, and (c) multiple non-nested
and non-integer seasonal periods . . . . . . . . . . . . . . . . . . . . . 5
1.2 An example of low-count time series which are sample inventory de-
mands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 A computer-simulated lung . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Some examples of respiratory motion time series . . . . . . . . . . . . 9
1.5 Outline of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Categorization of quantitative forecasting models . . . . . . . . . . . . 14
2.2 Examples of stationary and homogeneous nonstationary time series . 16
2.3 Illustration of 1-step forecasting . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Illustration of k-step forecasting . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Structure of the dynamic linear model . . . . . . . . . . . . . . . . . . 25
2.6 The proposed forecasting method with R estimates . . . . . . . . . . . 28
2.7 An illustration of the definition of Interval and a histogram of the in-
terval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Histogram of MSE of V . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9 Histogram of r/R of V . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10 Histogram of MSE of nf . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.11 Histogram of r/R of nf . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.12 Histogram of MSE of R . . . . . . . . . . . . . . . . . . . . . . . . . . 35
x
2.13 Histogram of r/R of R . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.14 Design of experiment and the prediction performance (mean of MSE
and r/R of each set of experiment) . . . . . . . . . . . . . . . . . . . . 36
2.15 Performance of the proposed procedure for forecasting . . . . . . . . . 37
2.16 Performance of forecasting with Updating-R vs. Fixed-R . . . . . . . . 38
2.17 The data used in the case study . . . . . . . . . . . . . . . . . . . . . . 40
2.18 Forecasting results of the six datasets . . . . . . . . . . . . . . . . . . 41
2.19 and PACF plots of original data in case 1 . . . . . . . . . . . . . . . . 41
2.20 ACF and PACF plots of deseasonalized data in case 1 . . . . . . . . . 42
2.21 ACF and PACF plots of model (0, 0, 0)(1, 1, 1)7 for case 1 . . . . . . . 43
2.22 ACF and PACF plots of model (1, 0, 1)(1, 1, 1)7 for case 1 . . . . . . . 43
2.23 MSE of the two methods: DLM (dark blue) vs. ARIMA for 6 cases
from left to right and then up to down . . . . . . . . . . . . . . . . . . 44
2.24 R estimate of the 6 cases from left to right and then. We see that except
case 5, the R estimate approaches to a stable value when there are more
observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.25 Updating-R vs. Fixed-R (R = [0.001 0.01 0.02 0.05 0.08]) . . . . 46
3.1 Estimation procedure for reference intervals . . . . . . . . . . . . . . . 53
3.2 Wavelet decomposition of a respiratory motion time series . . . . . . . 55
3.3 An example of an order-3 AR model built by 2-level wavelet scales . . 56
3.4 The general approach of the proposed pattern-based Variant-Best-Neighbors
prediction by using raw data . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Three best neighbors (solid black lines) of the current segment (solid
blue line), the dotted lines are their ”future” values . . . . . . . . . . . 59
3.6 Scatter Plots (left) and Autocorrelation Function (right) of the height
and the interval of respiratory motion versus its 1st lag . . . . . . . . . 60

xi
3.7 An illustration of the definition of Interval and a histogram of the in-
terval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8 The prediction accuracy for various window length with window ratio
to median interval (R) ranging from 0.3 to 1.5 for prediction horizon
h=1,5,10,15,20,25,30 for patient 16 . . . . . . . . . . . . . . . . . . . . 62
3.9 A 3-D plot of the prediction accuracy for various window length with
window ratio to median interval (R) ranging from 0.3 to 1.5 for predic-
tion horizon h=1,5,10,15,20,25,30 for patient 16 . . . . . . . . . . . . . 63
3.10 A flow chart of the VBN procedure: Phase I . . . . . . . . . . . . . . . 64
3.11 Online prediction of a patient’s respiratory data by using unaligned
BNs(Left) and right-aligned BNs(right). Belows are the best neighbors
marked with vertical lines in the time series. . . . . . . . . . . . . . . . 66
3.12 A zoom-in view of Figure 3.11, using unaligned BNs(Left) and right-
aligned BNs(right). We can see that right-aligned BNs is obviously
better than unaligned BNs. . . . . . . . . . . . . . . . . . . . . . . . . 66
3.13 Scatter plots of the error before tλ vs the error after tλ . Correlation
between the errors is observed. . . . . . . . . . . . . . . . . . . . . . . 69
3.14 An illustration of the error of the best neighbors before and after time
tλk . If the error at the left hand side is large, then error at the right
hand side is also likely to be large. . . . . . . . . . . . . . . . . . . . . 69
3.15 A real example of the error of the best neighbors before and after time
tλk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.16 An example of an outlier in the best neighbors . . . . . . . . . . . . . 71
3.17 Another example of best neighbors without any outliers . . . . . . . . 71
3.18 Kolmogorov-Smirnov test during prediction of the respiratory motion of
patient 2 with prediction horizon h = 15 . . . . . . . . . . . . . . . . . 73

xii
3.19 Illustration of support vector regression with insensitive parameter
and slack variable ξ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.20 Prediction performance of RPKM, RPKS and the state-of-the-art meth-
ods for prediction horizons h=20 to h=30 . . . . . . . . . . . . . . . . 79
3.21 Prediction performance of RPKM and RPKM(without adaptive ratio)
for prediction horizons h=1 to h=30 . . . . . . . . . . . . . . . . . . . 81
3.22 The general approach of the proposed pattern-based Variant-Best-Neighbors
prediction by using orthogonal polynomials approximated respiratory
motion time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.23 Legendre Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.24 An example of OPs approximation such that the approximation of lower
order (order 18) is better than higher order (order 20) . . . . . . . . . 88
3.25 Prediction performance of RPKM, OPPRED, res TVSAR, wLMS, SVR-
pred and SARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.26 Prediction results of Patient 9 with h=15 . . . . . . . . . . . . . . . . 92
3.27 Prediction results of Patient 2 with h=15 . . . . . . . . . . . . . . . . 93
3.28 This example shows that even the two time series have the same amount
of error but the occurrences of the errors can be very different. The
above plot shows that the two patterns match very well in the older
data (left) but do not match well in the newest data. Therefore, for
prediction, we would prefer the below one. . . . . . . . . . . . . . . . . 94
3.29 The weights of the shorter window (black dotted) and the longer window
(red dotted) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.30 Prediction performance of RPKM, OPPRED on noise-added data . . . 99
3.31 An example of noise-added time series in the simulation study. Simu-
lated noise is added to respiratory time series data of a patient . . . . 100

xiii
4.1 Schematic of experimental paradigm. A1 to A5) During study period,
participants were asked to judge whether the second stimulus matched
the first. B1 to B3) During test period, participants made memory
judgments to stimuli while rating their confidence. Low represents re-
member with low confidence, High represents remember with high con-
fidence, and New represents a judgment where participants thought the
stimulus was not studied. . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.2 The Map of the channel locations . . . . . . . . . . . . . . . . . . . . . 107
4.3 Artifact Removal Using ICA . . . . . . . . . . . . . . . . . . . . . . . 109
4.4 Topographies for ICA-Based Artifact Removal . . . . . . . . . . . . . . 110
4.5 Group of Channels for Inter- and Intra-hemispheric power band asym-
metry. For Inter-hemispheric power band asymmetry, the value is cal-
culated by pairs of same colors over another hemisphere. For Intra-
hemispheric power band asymmetry, the value is calculated by pairs of
different colors within the same hemisphere. . . . . . . . . . . . . . . . 111
4.6 Comparison for the EEG signals of 30 channels of musicians (blue line)
and non-musicians (red line) at epoch B1 and condition 30. . . . . . . 122
4.7 Head plot for musicians and non-musicians at epoch B1 at 100sec with
ICA-Based Artifact Removal . . . . . . . . . . . . . . . . . . . . . . . 123
xiv
LIST OF TABLES
Table Page
3.1 A list of latancies of different systems. . . . . . . . . . . . . . . . . . . 50
3.2 The prediction performance metrics, mean and standard deviation of R-
squares, of the proposed methods and the state-of-the-art of respiratory
motion prediction methods on 27 patients . . . . . . . . . . . . . . . . 78
3.3 The prediction performance metrics, mean and standard deviation of
R-squares, of the proposed approaches with and without adaptive ratio
on 27 patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4 The coefficients of orthogonal polynomials up to order 20 . . . . . . . 87
3.5 The prediction performance of RPKM and OPPRED on 27 patients. . 90
3.6 The prediction performance metrics, mean and standard deviation of
R-squares, of the proposed approaches on first 4 patients noise-added
respiratory motion time series. . . . . . . . . . . . . . . . . . . . . . . 98
4.1 Frequency ranges and the corresponding brain signal frequency bands
of the five levels of signals by discrete wavelet decomposition. . . . . . 115
4.2 A list of all comparison conditions of the experiments. For comparison
conditions 4 to 11, the naming structure was stimuli/grand truth/response.
For conditions 12 to 23, the naming structure was stimuli/response to
that stimuli in test session/if it was the 1st or 2nd stimuli. For condi-
tions 35 to 46, it was stimuli/confidence level of having seen the stim-
uli/correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
xv
4.3 The table of the classification accuracy for 46 conditions and 8 epochs
with 5-fold cross validation and 10 features selected by mRMR and
without any ICA artifacts removal. . . . . . . . . . . . . . . . . . . . . 120
4.4 The table of the classification accuracy for 46 conditions and 8 epochs
with 5-fold cross validation and 10 features selected by mRMR and with
ICA artifacts removal . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.5 The table of the classification sensitivity and specificity for 46 conditions
and 8 epochs with 5-fold cross validation and 10 features selected by
mRMR and with ICA artifacts removal . . . . . . . . . . . . . . . . . 124
4.6 The table of the classification sensitivity and specificity for 46 conditions
and 8 epochs with 5-fold cross validation and 10 features selected by
mRMR and with ICA artifacts removal . . . . . . . . . . . . . . . . . 125
xvi
CHAPTER 1
Introduction
1.1 Motivation
In 1960, Muth pioneered simple exponential smoothing (SES) and developed
a useful classification of the trend and seasonal patterns depending on whether they
are additive or multiplicative [1]. Following the work of Box and Jenkins, some
linear exponential smoothing forecasts were showed as special cases of ARIMA model.
In 1985, Snyder proposed a class of innovation state space models and paved the
way of the development of forecasting models for nonlinear exponential smoothing
methods which can be derived statistically. Through these years, many time series
problems have been identified and various methods have been developed to overcome
the challenges in time series analysis.
Time series analysis has attracted much attention in the past three decades.
According to Google Scholar search engine, until 1990, there were only 67,000 litera-
tures containing the keywords ”time series”, while in 2000 the number rose to 453,000.
Currently, 3,110,000 literatures can be found in contrast to the 8,680,000 literatures
containing the keyword ”data”. Although this may be a rough text mining, still, we
are able to see a tremendous volume of researches in the context of time series.
According to Google scholar, there are 1,660,000 results on the keywords ”time
series” and prediction. It constitutes over half of the total research volume in time
series analysis. Time series prediction is a very popular and challenging problem.
Time series analysis comprises methods for analyzing time series data in order
to extract meaningful information out of the data which includes several major ar-
1
eas of study: indexing, clustering, classification, prediction, summarization, anomaly
detection and segmentation [2]. These resemble to most data mining areas which
are very popular in scientific research and industries. These methods are very often
used together. For instance, Rubio proposed a weighted least squares support vector
machine for time series prediction which combines the use of prediction and classifi-
cation [3]. In our study of respiratory motion time series prediction, we apply pattern
matching, regression and anomaly detection.
Time series prediction is a study that uses a model to predict future values
based on historical data. In service industries, the ability to accurately estimate the
demand is very important for better marketing and cost saving. Preez [4] investigated
tourism demand from four European countries to the Seychelles by using time series
prediction. In healthcare service, time series prediction is applied on radiotherapy
to give patients a better quality of life [5, 6]. In hospital, management applies time
series prediction to predict the demand of nurse triage centers. [7] In transportation
industry, companies do prediction on the demand of cargo or human transportation.
Time series prediction can be classified into two categories: stationary time se-
ries prediction and non-stationary time series prediction. ARIMA and many stochas-
tic models, such as dynamic linear models, perform well on stationary data. ARIMA
is one of the most popular methods in time series prediction because of its generic
properties [8]. Autoregressive, moving average and exponential smoothing are spe-
cial cases of the ARIMA framework [1]. However, ARIMA has its own limitations.
Also, these methods generally do not do well on non-stationary data. Therefore, new
methods are developed to cope with various situations.
The advancement of information technologies in healthcare and other service
industries provide tremendous amounts of data. In other words, that provides us a lot
2
of research opportunities. Large amount of data can be available at service industries
such as hospitals, transportation companies and travelling agents.
This dissertation has been focused on the demand forecasting for service indus-
tries and respiratory motion time series prediction. These two problems cover both
stationary and semi-periodic time series. Therefore, the solutions provided have great
potential to apply on a broad range of problems in practice. Specifically, respiratory
motion time series prediction which is to predict the tumor position during radiother-
apy, are considered as well as the number of calls received at a nurse triage center
and the loads history of cargo in railroad service. The research seeks to explore fun-
damental methodologies to improve the currently available methods and to conquer
the difficulties in the prediction of time-varying time series data.
1.2 Research Objectives and Challenges
1.2.1 Demand Forecasting in Service Industries
Background Preez [4] points out that accurate forecasts of tourism demand are
essential for efficient planning by the various sectors of the tourism industry, and
forecast accuracy is particularly important in the tourism context as the tourism
product is perishable, e.g. unused plane seats, hotel rooms and hire car rental cannot
be stockpiled. Specifically, short-term forecasts can aid decision making in areas such
as scheduling, staffing and planning tour operator brochures.
According to Peck [9], emergency department (ED) crowding in hospital is a
major problem nationally and occurs when there is a mismatch between the demand
and supply of the resources needed to evaluate, treat, and discharge patients from
the ED. In current practice, bed requests and preparation to receive the patient often
are delayed until admission is certain. As ED is usually very crowded, unutilized
3
facilities are not desired. Therefore, Peck investigated forecasting methods to predict
the demand.
Babcock studied a forecasting problem on cargo grain for railroad industries
[10]. Grain shippers need the forecasts to evaluate transportation equipment needs,
establish marketing plans, and formulate strategies for negotiating prices and service
with railroads. Port authorities need forecasts of rail grain transportation for port
utilization monitoring and port expansion plan.
1.2.2 Problem and Challenges
Time series prediction involves time series analysis which is to decompose the
properties of the time series and quantify the individual property. These properties
include seasonality, non-periodic cycle, trend and irregular components [11].
To decompose the trend of a time series, the traditional method is to study the
autocorrelation and the partial-autocorrelation, and to fix the non-stationary trend
in order to make it stationary. Homogeneous non-stationary time series can be fixed
by differencing while other kinds of complex non-stationary property may need new
methods to solve.
The irregular component describes random and irregular influences in the time
series. This component may be decomposed and described by using statistical anal-
ysis. For instance, in dynamic linear models, the observation and the hidden average
are assumed to follow certain distributions.
In service industries, many of the time series data have strong periodicity. The
seasonality issue can usually be satisfactorily solved by the process of de-seasonalization.
After de-seasonalization, the time series can become stationary and many classical
methods can be applied on that stationary time series. For complex seasonal time
4
series, it can be multiple seasonal periods, high-frequency seasonality, non-integer
seasonality, and dual-calendar effects as shown in Figure 1 [12].
(a)
(b)
(c)
Figure 1.1: Examples of complex seasonality showing (a) non-integer seasonal periods,
(b) multiple nested seasonal periods, and (c) multiple non-nested and non-integer
seasonal periods
5
The seasonality can be found either by using Fast Fourier transform to analyze
the frequency components or by looking at autocorrelation function plot (ACF) to
find the lags with significant correlation.
In service industries, another challenge of time series prediction is low-count
time series pattern [13].In low-count time series, the counts in any given period are
sufficiently small that it may be unrealistic to forecast them with conventional mod-
els, including ARIMA, based on the normal distribution. Yelland proposed to use
dynamic linear models (DLM) to solve this type of problem. Figure 2 shows an
example of low-count time series.
Figure 1.2: An example of low-count time series which are sample inventory demands
One of the challenges of DLM is to determine the signal-to-noise ratio of the
time series which may need some experience on the data or a separated analysis to
obtain that information. This imposes inconvenience to practitioners.
This research presents a development of the dynamic linear model with appli-
cations on demand forecasting in service industries. In light of the above discussion,
the proposed method is to provide a framework that makes the dynamic linear model
ready for prediction as soon as data is ready.
6
1.2.3 Pattern-Based Online Prediction of Semi-periodic and Nonstationary Time
Series
Background In radiation therapy, it is important to give sufficient dose to tumor
and to reduce the damage to normal body tissues. To achieve that, the respiratory
motion in radiotherapy has to be accounted for. Currently, there are several methods
to account for respiratory motion: [5]
1. Motion-encompassing methods
2. Respiratory gating methods
3. Breath-hold methods
4. Forced shallow breathing with abdominal compression
5. Real-time tumor-tracking methods
Motion-encompassing methods are to estimate the mean position and range of
motion during CT imaging. Respiratory gating involves the administration of radi-
ation (during both imaging and treatment delivery) within a particular portion of
the patients breathing cycle, commonly referred to as the gate. Breath-hold meth-
ods are to control the tumor position for radiotherapy. For breast cancer, during
inhalation the diaphragm pulls the heart away from the breast, and thus there is
potential reduction of both cardiac and lung toxicity. Forced shallow breathing with
abdominal compression applies pressure to the abdomen to reduce diaphragmatic ex-
cursions, while still permitting limited normal respiration. Real-time tumor tracking
can in principle be achieved by using an MLC or a linear accelerator attached to a
robotic arm or, alternatively, by aligning the tumor to the beam via couch motion.
[5] To succeed, real-time tumor-tracking methods should be able to do four things:
(1) identify the tumor position in real time; (2) anticipate the tumor motion to allow
for time delays in the response of the beam-positioning system; (3) reposition the
beam; and (4) adapt the dosimetry to allow for changing lung volume and critical
7
structure locations during the breathing cycle. [5] In this dissertation, the prediction
of tumor position is studied. One way to predict the position is through the prediction
of respiratory motion which is a semi-periodic time series. In an example of tumor
located at superior segment of right lung in Figure 3 with a circle, respiration is the
dominant source of the tumor motion but other sources such as cardiac motion may
also be included in the time series [14].
Figure 1.3: A computer-simulated lung
The method proposed in this dissertation is designed for any time series that
shows the characteristics of semi-periodic time series. Other popular examples are
ATM cash demands and geo-data, such as sea level, sea temperature and seismic
activities.
Problem and Challenges Semi-periodic or quasi-periodic or quasiharmonic time
series refer to a signal that is virtually periodic, yet demonstrates both microscopic
and macroscopic variations. The characteristics of semi-periodic time series are drift-
8
ing in mean position, frequency and phase, and the occurrences can be considered as
random. Figure 4 shows the respiratory motion time series of lung tumor patients.
The respiratory motion patterns of patients demonstrate high individuality. This is
one of the challenges. To develop an application of tumor position prediction, the
method must be robust enough to give satisfactory results to all patients.
Figure 1.4: Some examples of respiratory motion time series
Another challenge is to fully use the historical data. Most of current state-of-
the-art prediction methods only consider local trends and are unable to take the whole
time series into account [15, 16, 17]. This wastes a lot of important information.
Therefore, in this dissertation, a pattern-based online prediction method is pro-
posed by using pattern recognition techniques to conquer the issue of the individu-
ality and to fully use all the available respiratory records. The prediction method is
to search for similar patterns from the history and then use the information of these
9
best matching patterns for prediction. There are two major challenges: 1) to find
the best neighbors which are the most relevant to the prediction problem and 2) to
maximally use the information of the obtained best neighbors.
In this study, we propose a weighted pattern-based variant best-neighbors pre-
diction method by using orthogonal polynomials approximations. This approach is
able to deliver satisfactory prediction results for semi-periodic time series.
1.3 Outline of the Dissertation
This dissertation focuses on the methodologies for addressing the two problems
as introduced in Section 2 which solve prediction problems in healthcare and service
industries. The problems will involve both stationary and nonstationary time series.
Figure 1.5 shows an outline of the dissertation.
Chapter 2 presents the details of application of ARIMA and Dynamic linear
model on stationary time series prediction problem for healthcare and railroad in-
dustries. ARIMA and DLM represent two different ways to explain and model time
series. The mechanisms of how they work and the limitations of the algorithms will
be discussed in the chapter.
Chapter 3 is dedicated to a prediction problem on semi-periodic time series.
Respiratory motion time series which is one kind of semi-periodic time series, is
selected to study in this chapter. Variant k-Best-Neighbors is used as the core method
for time series pattern matching. At the end of the chapter, the proposed method
will be compared to 3 state-of-the-art methods in respiratory motion prediction as
well as Seasonal ARIMA.
10
Figure 1.5: Outline of the dissertation
11
CHAPTER 2
ARIMA and Dynamic Linear Model for Time Series Forecasting
2.1 Literature Review
Due to the advance of information technology, there are more and more ways
to collect time series data. For example, consumer devices such as mobile phones
and laptop computers collect data and upload them to the Internet. Sensors such
as GPS and RFID can record positions with time stamps. Machines such as glass
making machines record quality measurements of products. Medical equipments such
as EEG and ECG record vital signals of patients. Time series data do not only
grow horizontally but also vertically such that more and more big data are available.
Increasing availability of time series data empowers us to obtain more knowledge via
techniques of statistics and data mining.
Tasks of time series data mining include indexing, clustering, classification, fore-
casting and anomaly detection [18]. Indexing assigns indices for a query of time series
to represent its similarity to a class. Prediction and certain analysis can be done by
using this similarity information. Clustering separates time series into groups based
on available independent variables. For each group, the time series show similar prop-
erties. Classification classifies time series into some predefined classes. Forecasting
models the underlying system and predicts future values. Anomaly detection finds
abnormality in the time series by comparing it to a benchmark of normal time se-
ries. This study will focus on the forecasting problem of time series data, which is a
critical concern in many applications, such as weather forecasting, natural disasters
forewarning, and prediction of epidemics and stock crashes [18].
12
Traditional time series data usually have relatively low dimensionality. While
data are becoming massive in volumes, traditional statistics/data mining techniques
are no longer able to cope with massive data. Also, due to high non-stationarity and
large amount of noises that may be present in some available time series data, tradi-
tional time series analysis tools such as ARIMA methods which assume stationarity
may be no longer suitable for these situations [18]. So it is necessary to find a way
to overcome the limitations of the traditional approaches and uncover complex and
hidden patterns in the massive non-stationary time series data.
Forecasting can be done by empirical qualitative analysis or mathematical quan-
titative analysis. In this perspective, forecasting methods can be broadly classified as
qualitative methods and quantitative methods. Empirical qualitative analysis such
as expertise, experience and intuition is useful when historical data are not avail-
able or irrelevant due to rapid change in circumstances [18]. Quantitative methods
can be further classified into causal and non-causal methods [1, 18]. Causal meth-
ods include Linear Regression, Econometrics Models and Artificial Neural Networks
(ANNs) models, where predictions are made based on data of relevant influential
factors. Non-causal methods include Moving Average [19, 20, 1, 21, 18], Exponential
Smoothing [1, 18], Box-Jenkins [19, 20, 1, 21, 18], State Space [1, 18, 22] and Spectral
Analysis [1, 18]. More details of quantitative methods are given in Figure 1.1.
Quantitative methods usually analyze some characteristics of the time series for
prediction, e.g., trend, seasonality, cycles and randomness. The trend of time series
tells us if it is increasing or decreasing and linear or non-linear. The seasonality tells
us if a pattern repeats at a fixed time interval. Cycle is very common in time series
data. The patterns may repeat at varying time intervals. Randomness makes pat-
terns difficult to identify and it is desired to identify the randomness from systematic
patterns [18].
13
Figure 2.1: Categorization of quantitative forecasting models
To evaluate the performance of forecasting models, mean squared error (MSE)
and its variants such as root mean squared error (RMSE), mean absolute error (MAE)
and mean absolute percentage error (MAPE) are commonly used. Another popular
performance measure is R-square, which represents the proportion of variability in a
data set that can be explained by the forecasting model. For model selection prob-
lems, Akaike information criterion (AIC) is often used, which adds penalty to model
complexity to discourage overfitting. However, AIC is not consistent. Bayesian infor-
mation criterion (BIC) and Hannan-Quinn criterion (HQC) are popular alternative
criteria to AIC and are consistent. Also, BIC generally penalizes free parameters
more strongly than AIC does. Cross validation is another model selection method.
In cross validation, data are divided into two sets, with one for training and the other
for validation. Considering the variation in the data, many different training and val-
14
idating sets will be used. Besides best subset selection, stepwise model selection, such
as forward selection and backward selection, are often used to find the best model.
2.2 ARIMA and Dynamic Linear Model
ARIMA methods are the most popular tools for time series forecasting and have
been applied in many different applications such as tourism forecasting [8, 6] where
Seasonal ARIMA is applied to determine the size of the flows of tourism demand
in Montenegro. Recently, the dynamic linear model (DLM) forecasting methods are
developed, which are shown to be advantageous in some cases, especially short-term
forecasting [6, 23]. In this section, basics and forecasting procedures of these two
methods will first be reviewed, and then the issues in using them in practice will be
discussed.
2.2.1 ARIMA
Basics Of ARIMA ARIMA stands for Autoregressive Integrated Moving Av-
erage. It models a time series in these three components. Hence, autoregressive mod-
els (e.g., GARCH), moving average models (e.g., SES, EWMA) and random-walk
models with or without trend are special cases of ARIMA models. For autoregressive
of order p (i.e., AR(p)), the current value depends on previous values plus the current
error term,
zt = φ1 zt−1 + . . . + φp zt−p + at (2.1)
where zt is the observation at time t, t is the regression coefficient of observation
at time t and at is the prediction error at time t. The Backshift operator B is
introduced to simplify the formula, where
15
zt−p = B p zp
Therefore, equation 2.1 can be rewritten as
(1 − φ1 B − · · · − φp B p )zt = at (2.2)
or
φ(B)zt = at (2.3)
For moving average of order q (i.e., MA(q)), the current value depends on the
current and previous error terms,
zt − µ = at − θ1 at−1 − · · · − φq at−q (2.4)
where t is the associated regression coefficient of observation at time t. Using
the Backshift operator notation, it can be expressed as
zt − µ = θ(B)at (2.5)
(a) Definition of Interval (b) Histogram of Interval
Figure 2.2: Examples of stationary and homogeneous nonstationary time series
Consequently, ARMA(p,q) process can be written as

16
φ(B)(zt − µ) = θ(B)at (2.6)
This model assumes that the underlying process is stationary which means that
the mean and variance are constant, and the autocovariances depend only on the time
lag. Figure 2.2a shows a typical example of stationary time series which resembles the
pattern of random walk. In contrast, Figure 2.2b shows an example of nonstationary
time series. The definition of stationarity can be expressed as follows:
• E(yt ) = µy for all t
• V ar(yt ) = E[(yt − µt )2 ] = σ 2 for all t
• Cov(yt , yt−k ) = γ for all t
Box and Jenkins [20] point out that homogeneous nonstationary sequences like
the data in Figure 2.2b can be transformed into stationary sequences by taking suc-
cessive differences of the series [19].
Similar to many earlier methods such as Holt Winters method on exponential
smoothing, ARIMA is able to model the seasonality of time series. The following
shows the multiplicative seasonal autoregressive integrated moving average model of
order (p, d, q)(P, D, Q)s :
φ(B)Φ(L)(1 − B)d (1 − L)D zt = θ(B)Θ(L)at (2.7)
ARIMA models assume the terms in the time series have linear relationships
and the residual follows normal or t distribution with a constant mean and variance.
Box and Jenkins [20] suggest a three-stage model building approach:
• Model Specification
• Model Estimation
• Diagnostic Checking
MODEL SPECIFICATION
17
The following rules are typically used in building ARIMA models [21]:
• Differencing (I) -
– Rule 1: If the series has positive autocorrelations to a high number of lags,
then it probably needs a higher order of differencing.
– Rule 2: If the lag-1 autocorrelation is zero or negative, or the autocorrela-
tions are all small and patternless, then the series does not need a higher
order of differencing. If the lag-1 autocorrelation is -0.5 or more negative,
the series may be overdifferenced.
– Rule 3: The optimal order of differencing is often the order of differencing
at which the standard deviation is lowest.
– Rule 4: A model with no orders of differencing assumes that the original
series is stationary (mean-reverting). A model with one order of differ-
encing assumes that the original series has a constant average trend (e.g.
a random walk or SES-type model, with or without growth). A model
with two orders of total differencing assumes that the original series has a
time-varying trend (e.g. a random trend or LES-type model).
– Rule 5: A model with no orders of differencing normally includes a constant
term (which represents the mean of the series). A model with two orders
of total differencing normally does not include a constant term. In a model
with one order of total differencing, a constant term should be included if
the series has a non-zero average trend.
• AutoRegressive (AR)
– Rule 6: If the PACF (Partial Autocorrelation Function) of the differ-
enced series displays a sharp cutoff and/or the lag- 1 autocorrelation is
positivei.e., if the series appears slightly ”underdifferenced” then consider
18
adding an AR term to the model. The lag at which the PACF cuts off is
the indicated number of AR terms.
• Moving Average (MA)
– Rule 7: If the ACF (Autocorrelation function) of the differenced series
displays a sharp cutoff and/or the lag-1 autocorrelation is negative,i.e.,
the series appears slightly ”overdifferenced”, then consider adding an MA
term to the model. The lag at which the ACF cuts off is the indicated
number of MA terms.
• AR and MA
– Rule 8: It is possible for an AR term and an MA term to cancel each
others effects, so if a mixed AR-MA model seems to fit the data, also try a
model with one fewer AR term and one fewer MA termparticularly if the
parameter estimates in the original model require more than 10 iterations
to converge.
• Unit Root
– Rule 9: If there is a unit root in the AR part of the modeli.e., if the sum
of the AR coefficients is almost exactly 1you should reduce the number of
AR terms by one and increase the order of differencing by one.
– Rule 10: If there is a unit root in the MA part of the modeli.e., if the sum
of the MA coefficients is almost exactly 1you should reduce the number of
MA terms by one and reduce the order of differencing by one.
– Rule 11: If the long-term forecasts appear erratic or unstable, there may
be a unit root in the AR or MA coefficients.
MODEL ESTIMATION
The ARIMA model can be written as
19
at = θ1 at−1 + · · · + θq at−q + zt − φ1 zt−1 − · · · − φp zt−p (2.8)
By assuming the error terms are identical and independently distributed by a
normal distribution with mean zero and variance σ 2 , we can use maximum likelihood
estimation to estimate θ and φ. Loss function can be derived according to the specified
model.
DIAGNOSTIC CHECKING As mentioned before, the basic assumption of ARIMA
is that the error terms are independently distributed by a normal distribution with
mean zero and variance σ 2 . For diagnostic checking, we need to check if the mean of
the residuals is close to zero, check the residual plot to see if the variance is constant
and check the autocorrelation plot to see if there is any violation of the assumption
of zero autocorrelation. The sample autocorrelations can be calculated as
Pn ¯ ¯
t=k+1 (ât − â)(ât−k − â)
râ (k) = Pn ¯2 (2.9)
t=1 (ât − â)
Forecasting Procedure Based On ARIMA After obtaining a time series model,
the forecast of zn+1 can be obtained by minimum square error estimation (MMSE).
The forecast of zn+1 is shown as the following conditional expectation
zn+1 = zn+l−1 + zn+l−7 − zn+l−8 + an+1 − θan+l−1 − Θan+l−7 + θan+l−8 (2.10)
Then, the forecasts can be obtained as
(1 − B)(1 − B 7 )zn (l) = 0 l>8 (2.11)
Therefore, the difference equation has the solution
(n) (n)
zn (l) = zn (r, m) = β0m + β1∗ r (2.12)
20
(n)
The forecast function is described by 7-time-unit levels β0m and a coefficient
(n)
β1∗ for the yearly trend change. Representing in autoregressive form, the forecasts
can also be interpreted in terms of exponentially weighted averages [22].
The above forecasting is 1-step-ahead forecasting. In some cases, k-step-ahead
forecasting may be desired. The mechanism of these two forecasting schemes is illus-
trated in Figure 2.3 and 2.4 The steps involved in the 1-step-ahead and k-step-ahead
forecasting are summarized below.
1-STEP FORECASTING Let t be the time and yt be the value at time t:
Figure 2.3: Illustration of 1-step forecasting
1. Train a model by using the training data set T = [1, 2, ..., a1].
2. Forecast only the next value (one-step forecasting) based on the trained model.
The Forecasting set is F = [a].
21
3. Repeat the process by adding the observed data point in training and moving
the forecasting period by 1 step, i.e., T = [1, 2, ..., a] and F = [a + 1].
4. To evaluate the performance of forecasting, we can calculate the forecasting
errors by comparing the forecasts to the observed data.
Figure 2.4: Illustration of k-step forecasting
1. Train a model by using the training data set T = [1, 2, ..., a1].
2. Forecast values with a fixed window τ (multiple-step forecasting) starting at ta .
The forecasting set is F = [a, a + 1, ..., a + τ ].
3. Repeat the process by adding the observed data point in training and moving the
forecasting period by 1 step, i.e., T = [1, 2, ..., a] and F = [a+1, a+2, ..., a+ +1].
4. To evaluate the performance of forecasting, we can calculate the forecasting
errors by comparing the forecasts to the observed data.
22
Issues Of ARIMA Forecasting ARIMA is a general time series analysis tool.
Under the framework of ARIMA, homogeneous nonstationary time series can be
transformed to stationary time series by differencing or logging. Time series often
show autocorrelation characteristics which means a large(small) previous value tends
to be followed by a large(small) future value. AR process is able to model this phe-
nomenon. Moreover, taking account of previous values is able to produce a more
accurate prediction of future values. Due to the generality, ARIMA models have
gained great popularity in time series analysis. However, there are three issues in the
use of these methods in practice:
1. Sensitivity to model specification: To apply ARIMA, we must first do
the model specification. As shown in Section 2.2.1, the process typically involves a
lot of personal judgments in determining the order of the AR and MA components
in the model. It requires experience to specify an appropriate model. Moreover, to
find a good model, a trial-and-error strategy may be followed which increases the
computational time.
2. Requirement of large amount of historical data: Large amount of
historical data are needed to build the ARIMA model. For instance, a model of order
(p, d, q)(P, D, Q)s has p + q + P + Q + d + mD number of parameters, and thus at least
p + q + P + Q + d + mD + 1 observations are required to estimate the parameters. As
a result, forecasting cannot be done at the beginning of the process, but has to start
after adequate historical data are available.
3. Assumption of deterministic mean: ARIMA assumes the underlying
process is deterministic, that is, the underlying mean is either a constant or has a
homogeneous trend which can be transformed. After transformation or differencing,
the mean of differenced time series should appear to be a constant. However, this
may not always be satisfied in practice. Due to the existence of random factors,
23
nonstationary time series are very common, especially data collected in short periods.
For such time series, the forecasting performance of ARIMA models may not be
satisfactory.
2.2.2 Dynamic Linear Models
Basics Of Dynamic Linear Models Dynamic Linear Models is a special type of
State Space methods for modeling time series data. It is a hierarchical model with
two levels: the mean model which represents the evolution of mean via state space
transition, and the observation model which models the observed values by taking into
account the mean evolution and observational errors. Figure 2.4 shows the structure
of the DLM. The basic elements of this model are as follows :
Observation model:
yt = µt + vt , vt ∼ N (0, V ) (2.13)
Mean Model:
µt = µt−1 + wt , wt ∼ N (0, RV ) (2.14)
where µt represents the hidden mean at time t, yt represents the observation at
time t, vt represents observational error and wt represents mean evolution. Note that
the variance of mean errors is R times of the variance of the observations, where R is
the signal-to-noise ratio, also called drift parameter. This model has three parameters:
the initial mean µ0 , the variance of observations V , and the drift parameter R. Usually
R is assumed to be known, while the other two parameters are unknown and need to
be estimated. This model is a basic DLM with only first order and constant variance
V.
24
Figure 2.5: Structure of the dynamic linear model
Typically the DLM is estimated using Bayesian methods [22]. The prior speci-
fication, resulting posterior and formulas for forecasting are as follows
Initial prior:
µ0 ∼ N (m0 , C0 V ) (2.15)
n0 d0
φ ∼ Gamma( , ) (2.16)
2 2
Posterior:
(µt |y1 , . . . , yt ) ∼ N (mt , Ct V ) (2.17)
nt dt
(φ|y1 , . . . , yt ) ∼ Gamma( , ) (2.18)
2 2
Updating recurrence relationships:
Ct + R
Ct = (2.19)
Ct + R + 1
mt = mt−1 + Ct (yt − mt−1 ) (2.20)
nt = nt−1 + 1 (2.21)
25
(yt − mt−1 )2
dt = dt−1 + (2.22)
Ct + R + 1
Forecasting:
ft = ȳ = mt−1
The initial prior for the mean is a normal distribution, and the prior for the
precision of mean (Φ = 1/V ) is a Gamma distribution. These are the conjugate priors
of the model, that is, the resulting posterior distributions have the same form of the
priors, except that the parameters need to be updated by equations 2.19 - 2.22. In
this framework, the forecast of the observation at time t is equal to the posterior
mean at time t1.
To apply the DLM model, the starting mean µ0 , the variance of observations V
and the drift parameter or signal-to-noise ratio R have to be estimated or specified.
For µ0 and V , the approach is to estimate them using historical data. For the signal-
to-noise ratio R, it is conventionally specified by users.
Forecasting Procedure Based On DLM The forecasting procedure based on
DLM has the following steps:
Step 1 : Before performing forecasting, the seasonality feature, if any, should be
removed by differencing the data,
ŷt = (1 − B S )yt (2.23)
Step 2 : Estimate the parameters of the initial priors using the historical data.
Suppose that the historical dataset contains H observations. m0 , C0 , n0 and d0 in the
initial priors can be estimated as follows
m0 = ỹ H (2.24)
26
C0 = 1 (2.25)
n0 = m (2.26)
d0 = n0 · var(y H ) (2.27)
m0 is the estimate of initial mean, µ0 . C0 is the estimate of the initial proportion
of mean variance to observational variance (or initial signal-to-noise ratio). n0 and
d0 are the degree of freedom and scale parameter of the Gamma distribution.
Step 3 : The forecast value ft can be obtained by the updating equations (2.16
to 2.19) and the forecasting equation (2.20).
Issues Of DLM Forecasting The DLM contains an important parameter, the
signal-to-noise ratio R, which is the ratio of the mean variance to the observational
variance. This parameter is typically unknown and needs to be specified by users.
However, there is no obvious way to determine its value, and users have to guess a
value based on their experience. This brings some inconvenience to the application of
this method in practice and may also affect its performance when the signal-to-noise
ratio is not specified appropriately.
2.3 The Proposed Method
For Dynamic Linear Models, the forecasting procedure starts with inputting the
initial priors and the signal-to-noise ratio, R. To determine the initial mean, µ0 , we
may use the information in the historical dataset. The variance of observations, V, is
assumed to be constant and unknown. It can be estimated by nt and dt in equations
(2.17)∼ (2.19). Finally, we need to specify the value of the signal-to-noise ratio, R,
27
then all parameters are all set and we are able to do forecasting. To make the process
of specifying R to be automatic, we propose a least square estimation method to find
an estimate for R which minimizes the mean square error of forecasting.
Figure 2.6: The proposed forecasting method with R estimates
Figure 2.6 illustrates the schema of the new forecasting procedure using the R
estimates. The left side of the diagram shows that the initial mean is estimated by
historical data as below:

1
Pnh
m0 = nh i=1 (Yi )
where nh is the number of historical data. The right side shows the forecasting
procedure with the updated R estimate. The new data are obtained sequentially
shown as blue squares. Each time a new observation is obtained, we will forecast
the next value by specifying the value of R which is automatically estimated by the
proposed method. nk is the total number of forecasting values. The forecasting
procedure of the proposed method can be summarized as follows:
Step 1 : Use historical data to estimate µ0
28
Step 2 : Assume the initial prior, C0 = 1 and use the proposed method to
estimate the signal-to-noise-ratio R

Pnk +nf
arg minr i=nk +1 (fi (r) − Yi ) 2
Step 3 : Update Ct in equation (2.16) using the R estimate Step 4 : Calculate
the forecast in equation (2.20) using Ct
Note that initially we assume the variance of hidden mean equals to the obser-
vational variance by assuming Ct = 1. When we have more data, Ct will be updated
and the ratio will become more and more accurate.
2.4 Numerical Study
A comprehensive numerical study is done to evaluate the performance of the
proposed forecasting procedure described in Section 2.3 and to compare the proposed
method to ARIMA and DLM methods described in Section 2.2. The scenario design
and computation procedure in the simulations will be presented in Section 2.4.1,
specific concerns to be addressed in this study will be given in Section 2.4.2, results
of simulations will be shown in Section 2.4.3, and our findings will be summarized in
Section 2.4.4.
2.4.1 Scenario Design and Computation Procedure
Some preliminary studies are first done to determine the ranges of the param-
eters V , nh , nf , R and rin such that patterns in the forecasting performance, if any,
can be captured. These five parameters are defined as follows in the simulation:
nh = [20 100 300 500 700];
nf = [20 100 300 500 700];
R = [0.001 0.01 0.02 0.05 0.08];
rin = [0.001 0.01 0.02 0.05 0.08 Rest ];

29
V = [10 20 30 40 50];
where nh is the number of historical data, nf is the number of forecasts, R
is the true value of the signal-to-noise ratio, rin is the specified value of R or R
estimate, Rest , obtained by the proposed method, and V is the observation variance.
For instance, because the historical data is only used for estimating the initial mean,
its range is selected to reflect the effect of accuracy of mean estimation. Also, small
values of rin is interested in this study because the signal-to-noise ratio is typically
very small in practice. Five fixed values are considered for each parameter to cover
three levels (small, medium and large). For rin , in order to compare the scenario with
specified r and the scenario with updated r by the proposed method, five fixed values
of r are considered along with the R estimate, Rest . To find out the effect of each
parameter, we only change one parameter and fix the others at typical levels. Details
of parameter settings are displayed in Table 4.1.
In the simulation study, 1000 time series are generated under each parameter
setting. For each time series, the data are generated through the following steps:
Step 1 : Specify parameters V, nh , nf , R and C0 as shown early in this section
Step 2 : Generate a random number µ0 , where µ0 ∼ N (m0 , C0 ) and m0 is set
to 0
Step 3 : Generate a random number v, where v ∼ N (0, V )
Step 4 : Generate a random number w, where w ∼ N (0, RV )
Step 5 : Calculate the hidden mean, µi by µi = µi−1 + wi
Step 6 : Calculate the observed value, Yi by Yi = µi + vi
Repeat Step 3 to Step 6 until i = nh + nf
The first nh will be used as historical data, while the rest of them will be used
in forecasting.
30
2.4.2 Concerns To Address
Our simulation aims to address the following questions:
Question 1: What is the performance of the updating-R procedure on R esti-
mation? Histograms are plotted to visualize the distribution of R-estimate in various
parameter settings. Moreover, to evaluate whether the procedure is able to accu-
rately estimate the true R, the ratio of R estimate to true R, r/R, is used to have
fair comparison. That is, r/R will be close to 1 if the estimation is good.
Question 2: What is the performance of ARIMA, DLM and DLM with updating-
R procedure and what are their strength(s) and weakness(es)? The root mean square
error (RMSE) is used to evaluate the prediction error and to compare the performance
of the three methods.
Question 3: How robust is each method to the specification of parameters, and
which method(s) is the most robust? Method(s) are always desired to be robust,
i.e., insensitive to setting of parameters. Particularly in our study, we also consider
variation of mean, i.e. mean drift, in time series.
Question 4: How do the parameters affect the performance on forecasting and
on R estimation (only for DLM with updating-R procedure)? Through the study of
the effect of parameters, we will be able to validate our method by evaluating whether
the responses are reasonable.
2.4.3 Results
Performance Of The Estimator Of R To show the performance of the pro-
posed estimator of R, typical levels of R are considered, in which the true R =
[0.001 0.01 0.02 0.05 0.08] and other parameters are set to be between in ranges
of typical levels shown in Section 2.4.1 in order to include a wide range of values while
obtaining analyzable time series. 2500 simulations are done and the R estimate is
31
obtained in each simulation. Figure 2.7 shows the number of forecasts, nf versus the
mean and variance of R estimates. From Figure 2.7a, we can see that the mean of R
estimates converge to the true value of R, which means that the proposed R estimator
is unbiased. Figure 2.7b shows that the variance of R estimate becomes smaller when
nf becomes larger, which is consistent with intuition.
Effect Of Parameters Simulations are done to study the effects of the parameters,
V, nf and R. Similar to the previous simulation, typical levels of the parameters are
set to include a wide range of possible values of the parameters while obtaining
analyzable time series. 2500 simulations are done and the R estimate is obtained in
each simulation. The parameter settings are shown in Section 2.4.1. Mean square
error (MSE) is used to measure the error of forecasting because it incorporates both
the variance of Y and its bias. The ratio between the estimated R and the true-R,
i.e. (r/R), is used to measure the deviation of estimated R from true R, i.e., r/R = 1
if r = R. It is useful to compare R estimates amongst time series with different true
R’s.
Figure 2.8 shows that higher V gives larger mean and variance of the error.
With the same signal-to-noise ratio, if the variance of the time series, V, is larger, the
evolution error of the time series will also be larger. DLM is to model the mean of
the time series, in other words, if V is larger, the range of the mean of the time series
will be larger. Therefore, the MSE of forecasting will be larger.
The distributions of rR under different V values are similar. No obvious trend
is observed by changing V. In conclusion, the value of V doesn’t affect the estimation
of R.
In Figure 2.10 and Figure 2.11, the graphs from left to right are the histograms
of MSE and rR respectively of increasing nf . V , nh and R are set to typical values
32
(a) Mean of R estimates (b) Variance of R estimates
Figure 2.7: An illustration of the definition of Interval and a histogram of the interval.
33
Figure 2.8: Histogram of MSE of V
Figure 2.9: Histogram of r/R of V
which are 10, 100 and 0.02 respectively. Figure 2.10 shows that higher nf gives more
precise error but it doesn’t help to reduce error. So, in our simulation study, the error
of forecasting is only determined by the range of the time series, i.e., larger fluctuation
will give larger error. However, increasing the number of data for forecasting, nf , can
help to increase the forecasting precision.
Figure 2.10: Histogram of MSE of nf
The distribution is skewed when nf is small and becomes more symmetric and
precise when nf is increasing. Figure 2.11 shows that the estimation of R is pretty
precise when nf = 500 . So, the forecasting precision is increased when the estimation
precision of R is increased.
34
Figure 2.11: Histogram of r/R of nf
In Figure 2.12 and Figure 2.13, the graphs from left to right are the histograms
of MSE and rR respectively of increasing R.V, nh and nf are set to typical values
which are 10, 100 and 500 respectively. Figure 2.12 and 2.13 show that larger R
gives slightly larger error but better R estimate. It is because that for larger R,
signal becomes clearer in contrast with noise. Therefore, the estimation of R is more
precise.
Figure 2.12: Histogram of MSE of R
Figure 2.13: Histogram of r/R of R
35
Figure 2.14: Design of experiment and the prediction performance (mean of MSE and
r/R of each set of experiment)
Forecasting Performance Of The Proposed Procedure In the simulation,
time series are generated by the following parameters: initial mean, variance, signal-
to-noise ratio, number of historical data and the number of data to be forecasted.
In order to study the effects of the parameters and to investigate the prediction
performance of the algorithms under various circumstances, five levels of values for
each parameter are carefully chosen. For each parameter, all of its levels will be tested
36
while other parameters are fixed to typical values so that the effects of each parameter
will be better visualized with more general time series. 1-step forecasting is used so
that the performances are clearly shown and compared by using simple forecasting
case. To compare the proposed method to the conventional method, two ways are
designed to specify the signal-to-noise ratio: use chosen fixed-R and use updating-R.
The prediction follows the trend of the time series even there is mean drift. In Figure
2.15 (upper panel), black line is the observed values and red line is the forecasts. In
the figure, we can see the mean drift from the observed data. Mean drift is defined
as random evolution of hidden mean value. In our model, we assume that the hidden
mean follows normal distribution with mean zero and variance wt (see eq. 2.25).
Therefore, the model performs well on time series data even with mean drift.
Figure 2.15: Performance of the proposed procedure for forecasting
Figure 2.15 (lower panel) shows R estimate of one simulation with parameter
V = 10, nh = 100, nf = 700 and R = 0.02. Again, we see that R estimate converges
37
to the true value asymptotically. It means that the proposed method can accurately
estimate the truth R and the model re-constructs the time series and predicts future
value very well.
Updating-R Vs Fixed-R In this simulation, we use the previous experimental
design again. Five levels are carefully chosen to represent various circumstances.
Applying the forecasting models, the mean square errors (MSE) of each method are
calculated and plotted in the same graph. The purpose of this simulation is to see
how well the updating-R method is over time, versus the fixed-R method.
Figure 2.16: Performance of forecasting with Updating-R vs. Fixed-R
From the graph, the performance over time is very clearly visualized. The red
line in Figure 2.16 shows the MSE of the model by using R estimate. The true R is
0.02 which is represented by the green line. The MSE of R estimate approaches to
that of the true R. If we set R to 0.05, the performance is slightly worse. If we set
R to 0.001, the performance is much worse. That means if we select a wrong R, the
38
performance can be significantly affected. And, our method can help to solve this
issue.
2.4.4 Summary of Numerical Study
The results in Section 2.4.3 address the concerns stated in Section 2.4.2. DLM
with updating-R procedure always performs better than fixed-R method. The pro-
posed R estimator is unbiased and converges to the true value when sample size is
large. DLM with updating-R is more robust than conventional DLM method in the
case of mean drift exists in the time series. In general, more forecasting data (nf )
gives smaller prediction variance and more accurate R estimate, larger observation
variance gives larger mean and variance of the error, and larger R gives slightly larger
error but better R estimate.
2.5 Case Study
Six real datasets are used in this study to compare the two methods described
in Section 2.2 and validate the effectiveness of the proposed method in Section 3.
These data are shown in Figure 2.17. Data 1 to Data 3 are counts of cargo loads in
railroad industries and Data 4 to Data 6 are counts of received calls in a nurse triage
call center. It is noticeable that Data 1 to Data 4 are stationary and cyclical with
about 7 days per cycle, while Data 5 and Data 6 are non-cyclical but may have mean
drifts.
Figure 2.18 shows the time series of Data 1 to Data 6 and their forecasting
results. Data 1 to Data 4 are de-seasonalized before input to the model. So they show
performing well on their cyclical properties in Figure 2.18. Data 5 and Data 6 do not
show any clear seasonal pattern so there is no straightforward way to eliminate their
cyclical components. Nevertheless, in the perspective of trend modeling, we see that

39
Figure 2.17: The data used in the case study
the forecasts of all datasets follow the trends of the time series. To tell the difference
of forecasting performance of cases, we have done some comparisons between ARIMA
and DLM methods.
To build ARIMA models for the time series, we need to specify a model first.
To do this, we need to look at the autocorrelation function (ACF) and partial auto-
correlation function (PACF) plots. The model building process of Data 1 is given in
the following as a demonstration.
From both plots, it is obvious that the cycle is 7 days. So we specify the
seasonality as 7. After specifying the seasonality, the ACF and PACF are plotted
again as follows. Now, we can see from ACF plot that there is autocorrelation at
lag 7. Therefore, we specify SMA as 1. Looking at PACF plot, it shows that there
is autocorrelation at lag 7 and 14, and maybe more but it’s not important because
40
Figure 2.18: Forecasting results of the six datasets
Figure 2.19: and PACF plots of original data in case 1
41
practically we don’t specify an order higher than 2. For simplicity, we would specify
SMA as 1. Again, ACF and PACF plots are plotted for the new model.
Figure 2.20: ACF and PACF plots of deseasonalized data in case 1
From both ACF and PACF plots, we can see autocorrelation at lag 1. Therefore,
MA and AR terms are both specified as 1. By plotting again the ACF and PACF
plots, we see the current model (1, 0, 1)(1, 1, 1)7 is much better.
However, according to experience, the model specification guideline does not
always give us the best model. Usually, we would try several more models to make sure
the best one is chosen. In this case, we would suggest users be aware of overfitting the
model. Therefore, we would try (0, 0, 1)(1, 1, 1)7 , (1, 0, 0)(1, 1, 1)7 and (0, 0, 0)(1, 1, 1)7 .
To build a DLM, we just need to follow the method in Section 2.3. Figure 2.3
shows the comparison of the results of the proposed method and the four ARIMA
models of the cases. Solid blue line represents the proposed method and colored dotted
42
Figure 2.21: ACF and PACF plots of model (0, 0, 0)(1, 1, 1)7 for case 1
Figure 2.22: ACF and PACF plots of model (1, 0, 1)(1, 1, 1)7 for case 1
43
lines represent ARIMA models with different model specification. From Figure 2.23,
it is clear that DLM performs better in case 5 and 6 than case 1 to 4. The reason is
probably that case 5 and 6 have mean drift which can be modeled by DLM but not
by ARIMA.
Figure 2.23: MSE of the two methods: DLM (dark blue) vs. ARIMA for 6 cases from
left to right and then up to down
Figure 2.24 shows the R estimate of the six cases. In case 1 to 4, the R estimate
approaches to zero which means the signal-to-noise ratio is very low. Case 5 and 6
shows a much larger R estimate. Recalling Figure 2.11 and 2.12, higher R values give
slightly larger error but if is much smaller than 0.1, the estimation will be biased and
have large variance. So, case 5 and 6 may give slightly larger error just due to having
large R. And, case 1 to 4 may inaccurately estimate their R’s. Therefore, we should
keep in mind these properties when we analyze Figure 2.24 and make any conclusion.
44
Figure 2.24: R estimate of the 6 cases from left to right and then. We see that except
case 5, the R estimate approaches to a stable value when there are more observations.
From Figure 2.25, Updating-R procedure is always better than Fixed-R method.
So, the proposed updating-R procedure is not only convenient which systematically
and automatically determines R but also better or at least as accurate as Fixed-R
method.
2.6 Summary And Future Work
Time series forecasting in service industries is a very challenging problem due
to the random patterns in the data. Many methods have been developed to solve
this problem and successfully applied in industries. However, there exist many issues
with these methods. For instance, the hidden mean of the time series in service indus-
tries may vary from time to time but ARIMA is not able to model this phenomenon.
Moreover, ARIMA needs a large amount of historical data to model a time series, but
in some cases this is not satisfied. Also, model specification sometimes imposes much
45
Figure 2.25: Updating-R vs. Fixed-R (R = [0.001 0.01 0.02 0.05 0.08])
difficulty to the practitioners. If a wrong model is specified, the result of forecasting
will be misleading. Therefore, people have developed DLM as alternative tools for
time series forecasting which is a special type of State Space methods. However, to
apply DLM, the signal-to-noise-ratio R has to be specified. Since the true value of R
is generally not available, the only way is to guess a value which is inconvenient and
unreliable. To conquer this problem, we propose a method to estimate R automati-
cally in the forecasting procedure. The properties of the proposed R estimator and
the new forecasting procedure with this estimator are studied by simulations. A case
study is also done where the proposed method is compared with the ARIMA method
using six datasets from service industries. It is found that this method outperforms
ARIMA when a time series has mean drift.
There are some open issues in this study which will be considered in my future
research. The following are two examples:
46
1. The variation of signal-to-noise ratio: Besides the issue of unknown
signal-to-noise ratio, in DLM, we assume the signal-to-noise ratio as a constant. In
practice, it is possible that the signal-to-noise ratio is not a constant but changes
over time. It will be interesting to study the behavior of a changing R and develop a
forecasting procedure with appropriate R estimates.
2. Variation of cycle or seasonality: West et al. [22] suggests two ways to
cope with cyclical feature and seasonality: Form-Free Seasonality Model and Fourier
Form Representation of Seasonality. In form-free seasonality model, he suggests first-
order and second-order polynomial to model the seasonality. In the method of Fourier
form representation of seasonality, he suggests to break down the time series into many
harmonic cycles to represent the seasonality. Application of these methods will be
considered in my future research.
47
CHAPTER 3
Pattern-Based Real-Time Prediction Of Semi-Periodic And Nonstationary Time
Series
3.1 Introduction
Respiration is an involuntary action, yet, within limits, individuals are capable
of controlling their own breathing [5]. The respiratory motion is mainly regulated by
the level of the partial pressure of carbon dioxide. Higher level of the partial pressure
of CO2 means more urge to breathe [5]. Besides physical factors, environmental and
psychological factors may also contribute to the variation. So, respiratory motion
time series is virtually periodic and time-varying.
The occurrences of drifts can be considered as random. One thing should be
noted is that both inter-individual and intra-individual variations are usually signif-
icant. Fortunately, respiratory pattern usually shows some statistical properties and
can be used for the prediction of tumor position. Much efforts has been devoted to
mining these properties and using them for respiratory motion prediction. In consid-
eration of high inter- and intra- individual variation, an adaptive method that works
well for all kinds of patients in all time is desired. The importance of respiratory mo-
tion prediction will be introduced in the following. In robotic radiosurgery or medical
imaging such as positron emission tomography (PET) or computed tomography (CT)
scan, all devices which involve tracking tumor position during treatment, suffer from
system latencies [5, 24]. System latency is the required response time of the whole
system from the time of detection to assert treatment action.
48
If a tumor is in the thoracic or upper abdominal region, respiratory motion will
be the dominant factor for tumor movement [25]. Without accounting for the res-
piratory motion, critical misalignment between irradiated field and the target tumor
volume in a treatment fraction may occur during radiotherapy, and impose signif-
icant radiation dose to normal body tissues. To account for respiratory motion in
radiotherapy, there are 5 ways [5]: motion-encompassing methods, respiratory gating
methods, breath-hold methods, forced shallow breathing with abdominal compres-
sion, real-time tumor-tracking methods.
The first 4 methods usually require patients’ participation which is inconvenient,
may not be available to all patients. Respiratory gating methods are first adopted in
Japan in the late 1980s [5]. Following its success, the method becomes popular and
much research efforts is put into this technique.
Another method which is developed to account for respiratory motion during
radiotherapy, is real-time tumor-tracking. One of the most famous systems, Cy-
berKnife Robotic Linear Accelerator, is a realization of real-time tumor tracking.
Real-time tumor tracking can utilize the total duty cycle without any interruption.
This method requires the least participation of human which may enhance the reli-
ability. Also, time can be saved which means more service can be provided within
a limited time span. To succeed, this method should be able to do four things: (1)
identify the tumor position in real time; (2) anticipate the tumor motion to allow for
time delays in the response of the beam-positioning system; (3) reposition the beam;
and (4) adapt the dosimetry to allow for changing lung volume and critical structure
locations during the breathing cycle. For gating method, special caution should be
taken by the therapist if the breathing pattern is different from the simulation. This
problem does not exist in Real-time tracking method as long as the system can do
the aforementioned four things. In this chapter, we discuss in detail of the second
49
Table 3.1: A list of latancies of different systems.
VERO MLC MAD CyberKnife
Position acquisition 25 309 30 25
Position calculation 2 20 - 15
Gimbals/MLS/robot control cycle 20 52 45 75
Other - 38 100 -
total 47 420 175 115
task: predicting the tumor motion to compensate the system latency which ranges
from 115 ms to 420 ms for different systems [26, 24].
The current generation of the CyberKnife has a latency of about 115 ms, down
from 192.5 ms in the previous version, which is still widely in use [26].
3.2 Related Work
Many prediction methods for respiratory motion have been developed for ra-
diation therapy to compensate the system latencies. The following list contains the
methods that are under spotlight in recent years.
Through this study, we have reviewed some latest methods. These includes
Time Varying Seasonal Autoregressive with Residual Adaptive(TVSAR)[27, 17], Neu-
ral Networks (NN) [28], Kernel Density Estimates(KDE) [29], Support Vector Re-
gression Prediction - SVRpred [16, 24], Recursive Least Squares (RLS) [24], The
MULIN Algorithms [24], Normalized Least Mean Squares, Wavelet-based Multiscale
Autoregression[24, 30], Wavelet Neural Network[31] and EKF Frequency Tracking.
Ernst [24] did a survey in 2013 on some of these methods. He concluded that
Wavelet-based Multiscale Autoregression (wLMS) [15, 30] has the best per-
formance in short term prediction, and Support Vector Regression prediction
(SVRpred) [16] which is developed based on Accurate Online Support Vector Re-
gression proposed by Renaud [30], performs better in longer term prediction. Support
50
vector regression (SVR) has been widely applied on respiratory time series prediction
[32, 16, 24, 33, 34, 35, 36]. For all of these current methods of SVR, the coefficients
are trained by either using the whole time series to capture all possible information
or using a sliding window to capture recent development. We will show that through
pattern matching, better prediction can be obtained by only selecting similar patterns
as inputs of SVR. Ichiji [37, 27, 17] proposed resi-TVSAR in 2013 and reported very
good performance of the method. Therefore, we select these two methods to compare
with our proposed method. We have selected TVSAR, wLMS and SVRpred. We
will also compare our proposed method to Seasonal ARIMA which is a very popular
classic method.
3.2.1 Time-Varying Seasonal Autoregression (TVSAR)
The time-varying periodic nature of respiratory motion makes the prediction
challenging. Most of methods which assume constant periodicity, do not apply to
this problem. ARIMA which is a very popular method, provides a general framework
which can model linear and stationary time series or homogeneous non-stationary
time series. Seasonal ARIMA (SARIMA) is developed to further cope with con-
stant periodicity. To overcome the time-varying periodic nature, Homma proposed a
modified SARIMA [37, 14] in 2009. The method converts the time-varying periodic
component to a constant periodic one by adjusting the time variation. However, be-
cause the time-varying periodic component is random in nature, it is very hard to
accurately convert it to a constant form. The outcome of modified SARIMA is not
satisfactory. In 2012, Ichiji et al [37] proposed TVSAR which is an AR model but
takes the varying cycles into account. The following is the detail of the method.
51
The N th SAR model of time series y(t), t = 1, 2, . . . is given as follows [37, 27,
17]:
N
X
y(t) = (t) + Φn · y(t − n · s) (3.1)
n=1
where Φn , n = 1, 2, , N are SAR coefficients, N is the order of SAR model, s is the
period of the target time series y(t), and (t) ∼ N (0, σ 2 ) is a Gaussian noise.
Then, the SAR model-based equation for h-sample ahead prediction is given by
putting t as t + h:
N
X
ŷ(t + h|t) = Φ̂n · y(t + h − n · s) (3.2)
n=1
We can note that this assumes a constant prediction horizon h for all time-varying
intervals.
To overcome the limitation of the general SAR model, TVSAR introduced a
time-varying and irregular interval, instead of a constant period s.
The TVSAR model can then be written as:

N
X
y(t) = (t) + Φn · y(t − r̂n (t|t)) (3.3)
n=1
The prediction equation of the N th TVSAR model for prediction horizon h is given
as:
N
X
ŷ(t + h|t) = Φ̂n · y(t + h − r̂n (t + h|t)) (3.4)
n=1
where r̂n (t|t) > 0 are called reference intervals for indicating the past observed values
at a corresponding phase to the current value y(t). The reference intervals are the
key part of TVSAR. By calculating the correlation to the past data to do a pattern
matching, reference intervals is found. In the other words, the reference intervals are
the points in the past which are at the same phase as the current value, i.e. y(t). An
SAR model is then built by using the points which are in the past few cycles and are
at the same phase as the to-be predicted value.

52
The reference intervals are found by finding the points which maximize the
correlation between the past data and the current window. The estimation procedure
is as follows:
At time t, calculate a correlation function of lag k=0,1,2, given by

w−1
1 X y(t − j) − µt y(t − j) − µt−k
C(t, k) = (3.5)
w j=p σt σt−k
where t and t are the sample mean and variance of a subset time series with length w
described as [y(t − w + 1), y(t − w + 2), . . . , y(t)]. Figure 3.1 illustrates the estimation
procedure [17, 37]. The nth reference interval is estimated by finding the lag k
Figure 3.1: Estimation procedure for reference intervals
which obtains nth local maximum the correlation function C(t, k). The k is described
as:
r̂n (t|t) = arg max C(t, k) (3.6)

k
53
where the search range is set as half of w around the reference intervals found at time
t − 1, i.e. rn (t − 1|t − 1) − w/2 < k < rn (t − 1|t − 1) + w/2 for each n.
Update the subset length by w = ba · r̂1 (t|t) + 0.5c . Here a is a coefficient
to adapt the length based on r̂1 (t|t). The initial reference intervals used for the
estimation procedure were given as:
r̂n (ts |ts ) = n × ŝ (3.7)
The major issues of TVSAR are:
1. It does not take the baseline shift and amplitude change into account.
2. It is hard to maintain an effective window size which makes the search of refer-
enced intervals difficult. It is either susceptible to noise by using small window
or overlooking a potential similar phase by using larger window.
3. It assumes a fixed prediction horizon, h, for all reference intervals which obvi-
ously have various lengths.
3.2.2 Wavelet-Based Multiscale Autoregression
Since respiratory motion is a mechanism of coordination of multiple muscles
and organs, respiratory motion time series is actually a record of activity of the chest
motion at particular location under coordination of multiple body parts within a
definite time horizon. Simply speaking, respiratory motion time series is a mixture of
various signals. Figure 3.2 shows an example of wavelet decomposition of respiratory
motion with 3 levels. Each band has its own pattern. By using wavelet decomposition,
it can enhance the prediction power of autoregressive model. Renaud et al.
[30] proposed to use á trous wavelet decomposition to build an autoregressive model
54
Figure 3.2: Wavelet decomposition of a respiratory motion time series
for prediction. He provided the following close form equations of the coefficients of
wavelets for subsequent time t. Therefore, online updating of wavelets is available.
1
c0,n = yn , cj+1,n = (cj,n−2j + cj,n ), Wj+1,n = cj,n − cj+1,n
2
A signal, y is decomposed into J levels discrete wavelet scales, W1 , ·, WJ , and the
smoothed signal, cJ by passing low-pass and high-pass filters with particular ranges
of frequencies, i.e. yn = W1,n + · · · + WJ,n + cJ,n . Also, aj and aJ+1 denotes regression
depth of level Wj and the smoothed signal cJ respectively. The multiscale autoregres-
55
sive (MAR) forecasting can be done by building up an autoregressive (AR) prediction
model for each wavelet scale and then sum up all of the prediction:
J
X
M AR
ŷn+k = wjT Ŵn,j + wJ+1
T
ĉn (3.8)
j=1
Ŵn,j = (Wj,n−2j ·0 , Wj,n−2j ·1 , . . . , Wj,n−2j ·(aj −1) )T (3.9)
ĉn,j = (cJ,n−2J ·0 , cJ,n−2J ·1 , . . . , cJ,n−2J ·(aJ+1 −1) )T (3.10)
An example with aj = 3 for all j and J = 2 is illustrated by Figure 3.3. The
Figure 3.3: An example of an order-3 AR model built by 2-level wavelet scales
weights of each AR model, wj , are learnt adaptively by least mean squares prediction
error of a window of data.
B = (ln−k , . . . , ln−k−M +1 )T ,
T T
lt = (Ŵt,1 , . . . , Ŵt,J , ĉTt )T (3.11)
w = (w1T , . . . , wJ+1
T
)T , sn = (yn , . . . , yn−M +1 )
Solve Bw = sn by normal equation to update w. B denotes the wavelet decomposition
at time n − k with window size M .
To cope with the regularity of the normal equation used to solve for w, Ernst
[15] replaces (B T B)−1 B T by the Moore-Penrose pseudo inverse of B. Ernst suggests
56
an exponential averaging parameter µ to include possible missing information which
is not included in the current signal window. Finally, wLMS is defined as follows:
J
X
M AR T T
ŷn+k = wn,j Ŵn,j + wJ+1 ĉ (3.12)
j=1
T
ŵn = (wn,1 , . . . , wn,J+1 ) (3.13)
w1 = · · · = wk+M −1 = (0, . . . , 0)T (3.14)
wn+1 = (1 − µ)wn + µB+ sn , µ ∈ [0, 1], n ≤ k + M (3.15)
Note that wLMS only uses the latest data to build a model. It performs very well on
very short term prediction but its medium to long term prediction ability is unsatis-
factory.
3.3 Pattern-Based Variant Best-Neighbors Prediction Using Raw Data
Respiratory motion is a type of semi-periodic motion which shows periodic-
ity with variation on mean position, phase and frequency. The occurrence of these
changes is due to complex causes and can be considered as random which means ,for a
pattern, a future value has an expected value with a variance. Even though the future
value of individual pattern is random, a collection of similar patterns will give very
accurate prediction. Our study of tumor position prediction shows that the average
of these responses provides a very accurate and effective prediction to the respiratory
motion.
In this study, a pattern-matching based framework is employed to discover
similar patterns from the past record and exploit the information of those patterns
for prediction of tumor position. Figure 3.4 shows the general approach of the pattern
based online prediction framework. Instead of only using recent cycles or using the
whole time series to train a model like most people do, an effective and accurate way
57
is to look for similar patterns from the past record and analyze the information of
those patterns to do prediction.
Figure 3.4 shows scatter plots and plots of autocorrelation function of height of
cycle of a respiratory motion time series versus their 1st lags. It shows that the height
of respiratory motion is autoregressive which provides us a strong basis of using
pattern matching to make prediction because similar patterns should have similar
response stochastically.
Figure 3.4: The general approach of the proposed pattern-based Variant-Best-

Neighbors prediction by using raw data
We introduce a pattern-based Variant Best Neighbors (VBN) prediction method.
The number of best neighbors which varies, is determined by a pattern-similarity
threshold and a cutoff value (k). The general approach is shown in Figure 3.4. Be-
fore starting prediction, the ratio of window size to cycle length (R) is needed to be
58
Figure 3.5: Three best neighbors (solid black lines) of the current segment (solid blue
line), the dotted lines are their ”future” values
determined by training and validation. In fact, the first step itself in the flowchart is
a prediction process in order to try various parameter sets.
Through validation, a parameter set that gives the most accurate result is se-
lected. After obtaining the optimal window ratio, a pattern library is built by using
the selected R. Then, we determine the best matching patterns from the pattern
library by a variant-best neighbors (VBN) approach. Next, the optimal subset of
predictive patterns for prediction is decided by statistical and feature analysis. The
previous step roughly provides us generally most matching patterns. This step is to
further refine the set of best neighbors(BNs) in order to significantly enhance the
prediction performance. After obtaining the BNs, we then can use their information
to make prediction by either simply taking the average of the ”future” values of the
BNs or by applying support vector regression. The performance of both methods
will be discussed in result section. Figure 3.5 illustrates a general approach. Three
patterns similar to the current segment at time t are found. The solid lines represent
59
Scatter plot of Height vs Height lag−1 Sample Partial Autocorrelation Function of Height
0.6
Sample Partial Autocorrelations

0.4
0.4
0.3
0.2
0.2
0
0.1
0 −0.2
−0.2 0 0.2 0 2 4 6 8 10 12 14 16 18 20
Lag
Scatter plot of Interval vs Interval lag−1 Sample Partial Autocorrelation Function of Interval
160 0.6
Sample Partial Autocorrelations
140
0.4
120
100
0.2
80
60
0
40
20 −0.2
−100 −50 0 50 100 0 2 4 6 8 10 12 14 16 18 20
Lag
Figure 3.6: Scatter Plots (left) and Autocorrelation Function (right) of the height
and the interval of respiratory motion versus its 1st lag
the current segment and their BNs. The dotted lines represent the future values of
the best neighbors and the current segment.
3.3.1 Personalized Pattern Monitoring Window Size
In our study, the respiratory motion data of 27 patients are studied. The length
of the data ranges from about 30 minutes to 60 minutes. The first 60% of the total
data is used to build up a pattern library; the next 20%, but at most 6,000 points,
of the data is used for validation and the remaining for testing. The median of the
intervals which is measured by the distance between consecutive peaks as shown in
60
Figure 3.7a, is used as a baseline for window sizes of two-window design. Median is
used because of the skewness of the distribution of the interval as shown in Figure
3.7b.
(a) Definition of Interval (b) Histogram of Interval
Figure 3.7: An illustration of the definition of Interval and a histogram of the interval.
To reasonably determine the final window size, we find the optimal ratio of
window size to the median of the peak-to-peak intervals, R, to control the window
size. Median is used because of the skewness of the distribution of the intervals. The
R is determined by validation. The window sizes are then multiplied by the selected
ratio, L = Lj × R for j = 1, 2, where L1 and L2 are the windows size for smaller and
larger windows respectively. Pattern libraries of all window size Bn×Lj are built up
where n is the number of time series segments in the library and Lj is the size of j th
window. In order study, the ratio R for smaller and larger windows are set to be 0.5
and 1.5 respectively.
In validation process, one window ratio is picked each time. R-square is used
for performance measurement because it provides a universal metric that describes
how close the prediction is to the real data.
61
Pt=n 2

t=1 (ŷ(t, R) − ȳ)
R̂ = arg max 1 − Pt=n 2
(3.16)
R t=1 (y(t) − ȳ)
h=1 h=5 h=10

0.9994 0.995 0.975
0.97
0.9992 0.99
0.965
0.999 0.985 0.96

0.5 1 1.5 0.5 1 1.5 0.5 1 1.5
h=15 h=20 h=25
0.94 0.9 0.86
0.93 0.88 0.84
0.92 0.86 0.82

0.5 1 1.5 0.5 1 1.5 0.5 1 1.5
h=30
0.8
Accuracy
0.78
0.76
0.5 1 1.5
Ratio of window size to its median (R)
Figure 3.8: The prediction accuracy for various window length with window
ratio to median interval (R) ranging from 0.3 to 1.5 for prediction horizon
h=1,5,10,15,20,25,30 for patient 16
From Figure 3.8, we find out that, for patient 16, if only one window is used
for prediction, shorter window size is better than longer window size. For longer
prediction horizon, we observe there are two local maximum at R=0.5 and 1.2.
The 3-D plot of the prediction accuracy for various Rs shows that the prediction
accuracy mostly depends on the prediction horizon but we can observe that the effect
62
Prediction performance for various window size of Patient 16
1
Prediction accuracy (R2)
0.95
0.9
0.85
0.8
0.75
0 0
10 0.5
20 1
30 1.5
Ratio to median interval
Prediction horizon (h)
Figure 3.9: A 3-D plot of the prediction accuracy for various window length with
window ratio to median interval (R) ranging from 0.3 to 1.5 for prediction horizon
h=1,5,10,15,20,25,30 for patient 16
of window ratio to median interval (R) becomes more significant when the prediction
horizon (h) goes up.
The window ratio R that maximizes the R-square is selected for prediction.
Table 3.3 presents the results of using adaptive ratio. It shows that with adaptive
ratio, the prediction is a little more accurate.
However, for now, only 3 ratios - 0.75, 1, 1.25 - are considered in the experiments
and one ratio is used for all windows in each time. In the future, we will study how
63
to optimize the window ratio for each window in order to obtain optimal result from
the current method.
3.3.2 Variant Best-Neighbors-Based Predictive Pattern Selection
Phase I: Searching for Initial Best Neighbors By Using Right Aligned
Patterns Figure 3.10 illustrates the phase I of VBN approach. In phase I, the
Figure 3.10: A flow chart of the VBN procedure: Phase I
best matched patterns are discovered by searching from the pattern libraries of two
window sizes, L1 and L2 .
Using the current segment to look for patterns that their similarity measures,
S defined as equation 3.18, satisfy a similarity threshold, θ. The baseline of candi-
dates must be removed because we are only interested in their patterns. The regular
VBN patterns have the problem of not considering signal shifting. Thus, it leads to
64
inaccurate prediction due to the shifted errors as shown in left graphs of Figure 3.11
and 3.12. To achieve accurate prediction, we propose to align the patterns at their
rightmost point during VBN searching process. As shown in Figure 3.11 and 3.12,
the right alignment puts higher weights on the right end and helps to obtain best
neighbors that have better match on the right end which we have found that is more
important for prediction.
ũn = un − un (end) (3.17)
where un is the segment of nth candidates in the pattern library and u0 is the current
segment. In this study, we introduce a usage of R-square as a similarity metric of two
line segments.
Pi=Lw
(un (i) − ū0 )2
Sn = 1 − Pi=1
i=Lw
(3.18)
i=1 (u0 (i) − ū0 )2
Figure 3.11 and 3.12 show the close views of typical examples of best neighbors
found by comparing raw patterns and right aligned patterns with the current segment.
First of all, From Figure 3.11 and 3.12 we see that using different alignment gives us
different best neighbors. Even though the best neighbors found by raw patterns may
show high similarity in overall but the right adjusted best neighbors show a better
matching at the right side which is the closest point to the point that is going to
be predicted. From Figure 3.11 and 3.12, we see that right aligned BNs give better
prediction result.
Then, we sort the list Sn in ascending order to begin acquisition of BNs. We
iteratively obtain the BNs from the top of the list until at least k BNs are obtained
and the next S1 is smaller than the threshold, θ. One important thing has to be
65
Figure 3.11: Online prediction of a patient’s respiratory data by using unaligned
BNs(Left) and right-aligned BNs(right). Belows are the best neighbors marked with
vertical lines in the time series.
Figure 3.12: A zoom-in view of Figure 3.11, using unaligned BNs(Left) and right-
aligned BNs(right). We can see that right-aligned BNs is obviously better than un-
aligned BNs.
66
done is to remove the candidates which are adjacent to the selected BNs in order to
prevent bias. Our removal strategy is shown as below:
Bk = {u ∈ Bk−1 |tu < tλk−1 − m ∪ tu > tλk−1 + m} (3.19)
where Bk denotes the library at the time of after entering the kth best neighbor; tλk
denotes the time at the end of k th best neighbor λ; and m denotes a small distance that
the candidates within this range are excluded. In the respiratory motion prediction
study, we choose the distance as one-fifth of the median of peak interval, i.e. b =
0.2 × median. Then, a list of BNs, Bλ , is obtained at the end of Phase I.
Phase II: Further Refining The Obtained Best Neighbors By Considering
A Larger Window In phase I, by using a short window, the best neighbors are
the patterns best matched in short term range. That is an important step to attain a
high accuracy in short term prediction. The better the matching in the most recent
data is, the closer the future trend of the best neighbors will follow. A short window
does better job in this issue.
The next step is to consider the pattern matching in a longer horizon of the best
neighbors obtained from the previous step in order to guarantee a correct matching
in the phase of respiratory motion. In other words, we need to look closer to have a
clear picture of what is short term trend in the current moment, then we look at a
bigger picture to figure out at what phase of a cycle the respiratory motion is.
The window sizes of the shorter and the longer windows are two parameters
needed to be decided. The optimal sizes can be determined by considering various
values in validation process. In validation, the shorter and the longer windows are
initially set to be 0.5 and 1.5 of the median cycle length. Then, the window sizes are
then multiplied by several ratio, R. The best ratio will be selected for the patients
individually.
67
In this phase II, to search for the best neighbors from the best neighbors ob-
tained in phase I, we repeat the process of phase I. Just that, in this time, the window
size changes to a longer window.
After finalizing the set of BNs, prediction is done by using the ”future” infor-
mation of the BNs. The simplest and effective method is taking an average of their
”future” values as equation 3.25. For some cases, when the best neighbors do not
match very well, we may consider support vector regression (SVR) and we name this
method as Right Adjusted Pattern-Based VBN-SVR Prediction (RPKS). The next
section will discuss about this.
Phase III: Best Neighbors Removal Using Statistical Analysis Figure 3.13
shows the scatter plots of the errors of short segments before and after tλk of the first
eight best neighbors of patient 23. Figure 3.15 is a close view of the best neighbors at
around tλk . The blue solid line is the current segment. This example shows that higher
error at the left side implies higher error at the right side. The correlation is obvious
at this example. Figure 3.14 is a drawing that clearly illustrates the phenomena.
The error of the mismatching of a short segment with length l just before and
after time t are significantly correlated, where t is the current time point of the current
segment and the corresponding time points of the candidates. To further refine the
set of the candidates, we suggest to remove those candidates with bigger mismatching
error of a few points just before time t. The sum of square errors of matching of a
short segment of λk with data Dt at time t is shown as below:

0
X
d λk = (Dtc +i − Dtλk +i )2 (3.20)
i=−l+1
68
Figure 3.13: Scatter plots of the error before tλ vs the error after tλ . Correlation
between the errors is observed.
Figure 3.14: An illustration of the error of the best neighbors before and after time
tλk . If the error at the left hand side is large, then error at the right hand side is also
likely to be large.
Next, since the distribution of error are skewed and the skewness varies among indi-
viduals, we remove the candidates with error larger than median plus one and a half
median absolute deviation(MAD) as follows:
B̃ = {λ ∈ B|dλ <= dmedian + 1.5 × dM AD } (3.21)

69
Figure 3.15: A real example of the error of the best neighbors before and after time
tλk
Although it is rare, sometimes the predicted value of a best neighbor is an
outlier among other candidates as shown in Figure 3.16. So, the last step of Phase
III is to remove those candidates as follows:
B̌ = {λ ∈ B|D(tλ + h) < min(max(D(tλ + h)),

(3.22)
Pλ75 + 1.5 × (Pλ75 − Pλ25 ))}
Figure 3.17 show two example of best neighbors without outliers. These examples
3.3.3 Online Prediction Frameworks Using the Selected Predictive Patterns
Prediction Using Average of the Future Values of Reference Patterns For
the best neighbors, the expected prediction values of h samples ahead is assumed to
be similar.
E[y(t + h)] ' E[y(tk + h)] f or k = 1, . . . , K (3.23)
70
Figure 3.16: An example of an outlier in the best neighbors
Figure 3.17: Another example of best neighbors without any outliers
where K denotes the number of the similar patterns, y(tk + h) denotes the value at
h samples ahead of kth referenced pattern. Therefore, the prediction of h samples
ahead made by using k best neighbors can be written as:

K
X
y(t + h) = (t + h) + Θk · y(tk + h) (3.24)
k=1
(t + h) denotes the error of predicting the value at time t + h and Θk denotes the
coefficient of the referenced value of k th BN. (t + h) includes the random error and
pattern mismatching error.
71
Taking the average of the referenced values, i.e. the samples which are h samples
ahead of all BNs, for prediction, the proposed model equation can be written as:
K
X
ŷ(t + h) = Θ̂k · ŷ(tk + h) (3.25)
k=1
1
We set Θk = K
to use the mean of the future values of the referenced patterns
for prediction.
Prediction Using Support Vector Regression Bootstrap aggregating, also called-
bagging, is an appropriate way to control and check the stability of the results, and
is asymptotically more accurate than the standard intervals obtained using sample
variance and assumptions of normality. By careful choice of the size of the resamples,
bagging can lead to substantial improvements of the performance of the kNN method.
Adr et al.recommend the bootstrap procedure for situations when the theoretical dis-
tribution of a statistic of interest is complicated or unknown and when the sample
sizeis insufficient for straightforward statistical inference.
In the proposed method, sometimes there is only a small number of nearest
neighbors obtained by considering R-square=0.95 as similarity threshold. In this case,
the sample size is too small to be sufficient for straightforward statistical inference.
Then, bootstrapping may help to control and check the stability of the results by
looking at the bootstrapping confidence interval.
The Right-aligned Pattern-based Variant-Best-Neighbors Prediction by Boot-
strapping Average (RPKM) is defined as:
M N
X 1X
ŷ(t + h) = ( y(tmn + h)) (3.26)
m=1
n n=1
If the referenced values of the nearest neighbors are normally distributed, we
may just directly use simple average and standard interval. In this case, bootstrapping
72
average and confidence interval are asymptotically consistent to simple average and
standard interval, so there is no benefit to use bootstrapping.
Check for Normality Therefore, before using bootstrapping, we use Kolmogorov-
Smirnov test to check for the normality of the referenced values of the nearest neigh-
bors. The null hypothesis states that the population is normally distributed.
Figure 3.18 shows the Kolmogorov-Smirnov test of patient 2 and prediction
horizon h = 15. Zero means fail-to-reject null hypothesis while one means rejecting
null hypothesis.
Kolmogorov−Smirnov Normality Test

Test result (1=reject null hypothesis)
0.8
0.6
0.4
0.2
−0.2
4.4 4.6 4.8 5 5.2 5.4
time x 10
4
Figure 3.18: Kolmogorov-Smirnov test during prediction of the respiratory motion of

patient 2 with prediction horizon h = 15
73
Prediction Using Support Vector Regression Best neighbors can only be very
similar to but rarely exactly identical to the current segment. Support vector regres-
sion (SVR) provides a way to fill in the gap of the similarity of the current segment
amongst the best neighbors. In general, SVR is able to enhance the prediction slightly
comparing to simply using mean value as predicted value.
SVR is to obtain a regression line by solving an optimization problem. The
advantages of support vector regression are the nonlinearity of regression line, the
ability of handling huge dimensions and its robustness to outliers. Due to these
strengths, SVR provides satisfactory respiratory motion prediction.
Figure 3.19 illustrates a simple example of SVR. The middle line is the regression
line and the upper and lower lines are the lines passing through support vectors. The
insensitive parameter, , is to adjust the coarseness of regression. Smaller epsilon
gives finer regression line. And, the slack variable, ξ, is to allow excluding outliers. A
regularization parameter, C, is introduced to control the cost of introducing slacks.
By choosing a kernel function, Φ, and using the obtained best neighbors to train
for the following SVR function. The weights, w, can be obtained by optimization
algorithms.
y(t + h) = wT µt (3.27)
Prediction is then done by in This can be formulated as the following optimization
problem [16, 35]:

L
1 X
min kwk2 + C (ξi + ξi∗ )
w,b 2
i=1
s.t.yi+δ − wT Φ(ui ) − b ≤ + ξi (3.28)

wT Φ(ui ) + b − yi+δ ≤ + ξi∗
ξi , ξi∗ ≥ 0, i = 1, . . . , L.
74
Note that the equations satisfy KKT conditions. We can introduce Lagrange multi-
pliers α, α∗ , η and η ∗ ≥ and rewrite the problem as follows:

L L
1 X X
L = kwk2 + C (ξi + ξi∗ ) − (ηi ξi + ηi∗ ξi∗ )
2 i=1 i=1
L
X
− αi ( + ξi − yi+δ + wT Φ(ui ) + b) (3.29)
i=1
L
X
− αi∗ ( + ξi∗ + yi+δ − wT Φ(ui ) − b)
i=1
From the saddle point condition, the partial derivatives of L with respect to the primal
variables (w, b, ξi , ξi∗ ) have to vanish for optimality.

l
X
∂b L = (αi∗ − αi ) = 0
i=1
l
X (3.30)
∂w L = w − (αi − αi∗ )ξi = 0
i=1
∂ξi∗ L = C − αi∗ − ηi∗ = 0
Substituting equations (3.30) into equation(3.29) yields the dual optimization prob-
lem: 
 − 1 Pl (αi − α∗ )(aj − a∗ )hxi , xj i

2 j=1 i j
maximize Pl (3.31)
 P l
 − j=1 (αi + αi∗ ) + j=1 yi (αi − αi∗ )
Pl
subject to j=1 (αi − αi∗ ) = 0 and αi , αi∗ ∈ [0, C]
By solving equation 3.31, we obtain the regression function (equation 3.27) and
input the current segment and the best neighbors into that function to do prediction.
The proposed Right-Aligned Pattern VBN-Based SVR prediction is denoted as RPKS
in the rest of the paper.
75
Figure 3.19: Illustration of support vector regression with insensitive parameter and
slack variable ξ.
3.3.4 Comparison for the Prediction Performance of RPKM and Some State-Of-
The-Art Methods
The followings are the comparisons for the prediction performance of RPKM
and the latest state-of-the-art methods. Wavelet-based Multiscale Autoregres-
sion (wLMS) and Support Vector Regression prediction (SVRpred) are con-
cluded as the best methods by a survey conducted by Ernst [15, 30, 16, 24] in 2013.
TVSAR is a method developed by Ichiji [27] also published in 2013. They all
claim these methods are the best. In addition, Seasonal ARIMA is also added to the
comparison as most people are familiar to this method.
Data Acquisition and Experimental Settings Time series of abdominal dis-
placement of 27 lung and liver cancer patients were collected with the Real-time
Position ManagementTM (RPM)(Varian Inc., Santa Clara, CA) infrared camera and
reflective marker block system during their PET/CT examination. The time series
serves as a respiratory motion surrogate. [38]
76
The use of data was approved by the appropriate Institutional Review Board
in compliance with the Health Information Privacy and Portability Act. [38]
The sampling rate of respiratory traces was 30 Hz. The duration of data collec-
tion is from 15 to 45 minutes. The respiratory motion traces of 27 patients demon-
strate very high individuality.
In the data, 60% of it is used for training; 20% is used for testing and the
remaining is for testing.
For TVSAR and wLMS, they do not need training. Prediction directly started
at the testing set.
For the experiment of PKRM and PKRS, the threshold for obtaining the best
neighbors is set to be 0.95.
For RPKS and SVRpred, we consider −212 , −211 , . . . , 212 for kernel parameter,
γ, and 0, 0.01, 0.02, . . . , 0.1 for insensitive zone, , and max(|ȳ + 3σy |, |ȳ − 3σy ) for
regularization parameter, C.
Prediction Performance of RPKM, RPKS and the latest state-of-the-art
methods Table 4.2 shows the prediction performances of RPKM, res TVSAR,
wLMS, SVRpred and SARIMA. And, Figure 3.20 shows the box plots of the predic-
tion performance of the proposed methods and the current state-of-the-art methods.
Even though we only consider 3 ratios and the ratio is fixed for both windows, we
still can see a little improvement by using adaptive windows. By considering opti-
mizing the window size for individual patient, we expect that the performance would
be further improved.
Among the state-of-the-art methods, wLMS performs very well in short term
prediction and res TVSAR outperforms wLMS for long term prediction. Except
SVRpred, all other methods perform better than SARIMA.
77
Finally, for RPKM and RPKS, it is very obvious that they outperforms all
other methods significantly. Also, based on the results, RPKS is slightly better than
RPKM.
Table 3.2: The prediction performance metrics, mean and standard deviation of R-
squares, of the proposed methods and the state-of-the-art of respiratory motion pre-
diction methods on 27 patients
Prediction horizon 1 5 10 15 20 25 30
RPKM mean 0.998 0.976 0.918 0.831 0.728 0.620 0.523
std 0.001 0.018 0.052 0.095 0.141 0.179 0.206
RPKS mean 0.998 0.978 0.920 0.836 0.732 0.624 0.523
std 0.002 0.018 0.053 0.093 0.132 0.167 0.196
res TVSAR mean 0.964 0.834 0.684 0.462 0.229 0.013 -0.146
std 0.088 0.378 0.393 0.436 0.487 0.462 0.454
wLMS mean 0.996 0.880 0.648 0.386 0.131 -0.083 -0.233
std 0.005 0.322 0.487 0.527 0.535 0.526 0.520
SVRpred mean 0.908 0.738 0.639 0.347 0.029 -0.154 -0.323
std 0.044 0.075 0.099 0.164 0.324 0.323 0.359
SARIMA mean 0.979 0.846 0.608 0.231 -0.053 -0.292 -0.414
std 0.019 0.127 0.281 0.469 0.466 0.475 0.479
78
Prediction performance with prediction horizon h = 1 steps Prediction performance with prediction horizon h = 5 steps
1 1
0.998 0.95
0.996 0.9
0.994
0.85
0.992
0.8
0.99
RPKM RPKS res_TVSAR wLMS SVRpred SARIMA RPKM RPKS res_TVSAR wLMS SVRpred SARIMA
(a) A close view of Prediction horizon h=1 (b) Prediction horizon h=5
1 1
0.9
0.9
0.8
0.8 0.7
0.6
0.7
0.5
0.6 0.4
0.3
0.5
(c) Prediction horizon h=10 (d) Prediction horizon h=15

1 1
0.8 0.8
0.6
0.6
0.4
0.4
0.2
0.2 0
−0.2
0
(e) Prediction horizon h=20 (f) Prediction horizon h=25

Prediction performance with prediction horizon h = 30 steps
1
0.5
−0.5
RPKM RPKS res_TVSAR wLMS SVRpred SARIMA
(g) Prediction horizon h=30
Figure 3.20: Prediction performance of RPKM, RPKS and the state-of-the-art meth-
ods for prediction horizons h=20 to h=30
79
Prediction Performance of RPKM With and Without Adaptive Ratio Ta-
ble 3.3 shows the prediction performances of RPKM and RPKM with adaptive ratio.
And, Figure 3.21 shows the box plots of the prediction performance. Based on the
experimental result, adaptive window enhances the prediction accuracy of RPKM.
squares, of the proposed approaches with and without adaptive ratio on 27 patients
RPKM mean 0.998 0.976 0.918 0.831 0.728 0.620 0.523
std 0.001 0.018 0.052 0.095 0.141 0.179 0.206
RPKM(without adaptive ratio) mean 0.998 0.976 0.916 0.827 0.721 0.612 0.517
std 0.001 0.019 0.054 0.096 0.142 0.180 0.206
3.4 Pattern-Based Variant-Best-Neighbors Prediction Using Orthogonal Polynomi-
als Approximated Respiratory Motion Time Series
Directly using raw data for pattern matching works as long as the signal is clean
with little noise. However, the quality of medical devices varies from one to another.
Some systems may have more noise than others. So, it is desirable to find a robust
method which is able to cope with data with more noise in order to have consistent
performance.
Besides, using raw data to build up the pattern libraries consumes a lot of space.
The higher the sampling rate is, the finer the signal can be obtained, however, which
also means a larger size of the library. Sparseness is a very popular topic in data
mining. Using reduced signal to represent the original signal can usually speed up
the computation and also makes the system more expandable.
The method of using orthogonal polynomials approximation for the pattern-
based variant best neighbors time series prediction is named as OPPRED which
80
1 1
0.998 0.95
0.996 0.9
0.994
0.85
0.992
0.8
0.99
RPKM(without adaptive ratio) RPKM RPKM(without adaptive ratio) RPKM
(a) Prediction horizon h=1 (b) Prediction horizon h=5

1 1
0.9
0.9
0.8
0.8 0.7
0.6
0.7
0.5
0.6 0.4
0.3
0.5

1 1
0.8 0.8
0.6
0.6
0.4
0.4
0.2
0.2 0
−0.2
0

1
0.5
−0.5
RPKM(without adaptive ratio) RPKM
Figure 3.21: Prediction performance of RPKM and RPKM(without adaptive ratio)

for prediction horizons h=1 to h=30
81
follows the same structure of RPKM as shown in Figure 3.22 except that the data is
converted into OPs approximations.
Figure 3.22: The general approach of the proposed pattern-based Variant-Best-

Neighbors prediction by using orthogonal polynomials approximated respiratory mo-
tion time series
3.4.1 Orthogonal Polynomials Appximation
Fuchs et al [39] proposed a method of online segmentation of time series based on
Legendre orthogonal polynomials (OPs) least-squares approximations. The method
originally intended for time series segmentation but it shows nice properties on the
application of time series pattern matching.
A time series consisting of real-valued samples yt with t = 0, ..., N with sampling
rate, s, can be modeled by a parameterized function f (x) : R → R. Here, we assume
that f (x) is linearly dependent on a parameter vector w with elements wk ∈ R(k =
0, . . . , K). Note that we do not claim that f (x) is a linear function in x. More
82
concretely, we assume that f is a linear combination of K + 1 (linear or nonlinear)
so-called basis functions fk :

K
X
f (x) = wk · fk (x) (3.32)
k=0
These basis functions may be polynomials, wavelets, sigmoid functions, or si-
nusoidal functions, for instance.
We may write the values of the K + 1 basis functions for the N + 1 points in
time x0 , . . . , xN into a matrix

 
 f0 (x0 ) · · · fK (x0 ) 
 . .. .. 
F= . . (3.33)
 . .  
 
f0 (xN ) · · · fK (xN )
If we combine the N + 1 samples of the overall time series into a vector y with
elements yn , the linear least-squares problem we want to solve can be denoted by
min kFw − yk (3.34)

w
with k · k being the euclidean norm. Its solution wLS can be found by setting the
derivative with respect to w to zero. First,
kFw − yk2 = hFw − y|Fw − yi

(3.35)
T T T T
= w F Fw − 2y Fw + y y
with h·|·i being the standard inner product in a real-valued vector space. Then,
∂kFw − yk2
= 2FT Fw − 2FT y (3.36)
∂w
leads to the least-squares solution
wLS = (FT F)−1 FT y (3.37)
83
provided that the matrix FT F is regular. Then, we have the pseudo-inverse of F, i.e.
wLS = F+ y (3.38)
where F+ = (FT F)−1 FT
In general, a real-valued pseudo-inverse A+ of a matrix A has the following
two properties (two of the four so-called Penrose criteria). First, (AA+ )T = AA+ .
Second, AA+ A = A and, consequently, AA+ AA+ = AA+ . Thus, the residuum
resulting from this least-squares approximation is
rLS = kFwLS − yk2
= yT (FF+ )T FF+ y − 2yT (FF+ )y + yT y (3.39)
= yT y − wTLS FT FwLS
where FT y = (FT F)F+ y = (FT F)wLS and (FF+ )T FF+ = FF+ With the term
(average) squared error, we refer to the residuum divided by the number of observed
samples:
1
2
σLS = (yT y − wTLS FT FwLS ) (3.40)
N +1
With 3.40, we could now determine the squared error once we have gotten the least-
squares solution for w.
In general, the solution of a linear least-squares problem is found by conducting
a QR decomposition or a singular value decomposition (SVD) of F.
Now, assume that the selected K +1 basis functions are orthogonal with respect
to an inner product yielding the value N

P ˙
n=0 fk1 (xn )fk2 (xn ) = 0 for any two basis
functions fk1 and fk2 with k1 6= k2 . This is the case for special kinds of polynomials
84
(see Section 3.2), for wavelet families, or the sinusoidal functions used for discrete
Fourier transforms, for instance. Then,

  
 f0 (x0 ) · · · f0 (xN )   f0 (x0 ) · · · fK (x0 ) 
 . .. 
..   ... .. .. 

FT F = 

.. . .  . . 
  
fK (x0 ) · · · fK (xN ) f0 (xN ) · · · fK (xN )
  (3.41)
2
kf0 k · · · 0 
 . .. .. 
= . .
 . .  
 
2
0 · · · kfK k
That is, FT F is a diagonal matrix which can be inverted if the elements in the
diangonal, which are the squared norms of the basis functions, are nonzero. This can
be easily guaranteed by an appropriate choice of basis functions. From 3.37, we then
get
wLS = (FT F)−1 FT y

  
1 1
 f0 (x0 ) kf0 k2 · · · f0 (xN ) kf0 k2   y0 
 .. .. ..  . 
  .. 
= . . .  
  
1 1 (3.42)
fK (x0 ) kfK k2 · · · fK (xN ) kfK k2 yN
 
PN yn
 n=0 kf0 k2 f0 (xn ) 
 .. 
= . 

P 
N yn
n=0 kfK k2 fK (xn )
That is, the least-squares solution can be written as a linear combination of the
training samples (cf. the dual representations of classifiers which are common in the
field of support vector machines, for instance).
85
With this result for wLS , with Equation 3.40, and with the definition wk =
PN yn
n=0 kfk k2 fk (xn ) for k = 0, . . . , K (elements of the solution vector wLS ), the squared
2
error σLS now becomes
N K
2 1 X
2
X
σLS = ( y − w2 kfk k2 ) (3.43)
N + 1 n=0 n k=0 k
Assume that, in a time window of length L+1, the values y0 , y1 , . . . , yL measured
at equidistant points in time x0 , x1 , . . . , xL must be approximated by a polynomial p
with degree K ≤ L(L ∈ N0 , K ∈ N0 ) in the least-squares sense.
It is well known that orthogonal polynomials with leading coefficient 1 in the
vector space P(R, R) of real polynomials on R fulfill the following three-term recur-
rence relation:
p−1 (x) = 0, (3.44)
p0 (x) = 1, (3.45)
pk+1 (x) = (x − ak )pk (x) − bk pk−1(x) (3.46)
For sliding window method, the approximation window [x0 , x1 , . . . , xM ] is lo-
cated at [0, 1, . . . , L]. In our study, we use Legendre orthogonal polynomials, as shown
in Figure , which fulfills the three-term recurrence relation with
L
ak = , (3.47)
2
k 2 ((L + 1)2 + k 2 )
bk = (3.48)
4(4k 2 − 1)
This provides a fast update procedure for generating the pattern libraries.
Due to the orthogonality between OPs, the coefficients of OPs are independent
to each other. Referring to equation 3.42, we can observe that the coefficients of the
corresponding OPs do not change even the order goes up. If the order goes down, the
86
Figure 3.23: Legendre Polynomials
coefficients of the OPs which have orders higher than the basis function, will equal
to zeros.
In the other words, only one approximation has to be done to obtain the approx-
imations equaled to or below that order. For instance, if we obtain the approximation
of order 20, we also obtain the approximations of order 19, 18, and so on.
This property is important to our application on time series pattern approxi-
mation because higher order of approximation does not necessarily give us the best
approximation. One example is shown in figure 3.24. Its coefficients are shown in
table 3.4.
Table 3.4: The coefficients of orthogonal polynomials up to order 20

w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10
0.267353 -4.71E-05 8.09E-06 5.60E-08 5.81E-10 6.94E-13 -2.07E-14 -7.45E-16 -8.39E-18 -9.75E-20 -3.60E-22
w11 w12 w13 w14 w15 w16 w17 w18 w19 w20
2.59E-24 1.51E-25 1.77E-27 1.28E-29 -5.73E-32 -1.90E-33 -2.98E-35 -2.14E-37 -1.33E-39 -5.04E-41
87
Figure 3.24: An example of OPs approximation such that the approximation of lower
order (order 18) is better than higher order (order 20)
The Advantages of Using Orthogonal Polynomials Approximations The
advantages of using orthogonal polynomials approximations can be summarized as
follows:
Efficiently determining the best order of OP Higher order does not necessar-
ily give better approximation. For orthogonal polynomials approximation, hav-
ing done only one approximation provides us all approximations of lower orders.
This unparalleled property efficiently determine the best order. In contrast, for
traditional approximation, we have to do the approximations for every order so
as to determine the best order.
Sparse data representation Currently, the sampling rate of our data is 30 Hz and
a typical respiratory cycle can take 6 minutes, so there can be 180 samples in
one cycle. Using Orthogonal Approximation with order 20, only 21 coefficients
are needed to be recorded.
Accurate approximation For about 1 to 2 cycles of respiratory motion, using or-
der 20, we are usually able to obtain very good approximation with R-squares
typically higher than 0.99.
Fast updating and reconstruction When the window length, N , is fixed, the or-
thogonal polynomials are fixed and can be easily calculated by using the three-
88
term recurrence relation shown in equation 3.47. Due to the orthogonal prop-
erties of the OPs, we obtain a close form equation for the coefficients shown in
equation 3.42.
Signal smoothing Least squares error is used for approximation. Usually the model
cannot model the time series perfectly. Tradeoffs have to be made in approxi-
mation and outliers are tended to be ignored.
Readiness for clustering and classification by using coefficients The result-
ing coefficients represents the weights of each orthogonal polynomial of the
approximation. Potentially, it can be used as one of features for clustering and
classification.
Distance Measure of Patterns Using the Coefficients of Orthogonal Poly-
nomials Approximation In this study, we still use R-squares as the similarity
metric of two patterns.
SSE
Sn = 1 − (3.49)
SStot
while the sum of square of errors (SSE) is

n
X n X
X K
2
|ei | = | ∆wk (fk (i) − fk (n))|2 (3.50)
i=1 i=1 k=1
By expansion, the SSE can be written in a close form:
SSE(w) = c11 ∆w12 + c12 ∆w1 ∆w2 + · · · + ckk ∆wk2 (3.51)
where cjk for j, k ∈ 1, . . . , K which are constants when the orthogonal polynomials,
fk , and n are fixed. Therefore, the similarity metric, Sn (w), only depends on the
orthogonal polynomials coefficients, w.
SSE(w)
Sn (w) = 1 − (3.52)
SStot
89
3.4.2 Prediction Results of RPKM and OPPRED
Table 3.5 shows that the performance of RPKM and OPPRED are very close
to each other. RPKM is just slightly better than OPPRED.
Table 3.5: The prediction performance of RPKM and OPPRED on 27 patients.

RPKM mean 0.998 0.976 0.918 0.831 0.728 0.620 0.523
std 0.001 0.018 0.052 0.095 0.141 0.179 0.206
OPPRED mean 0.998 0.975 0.914 0.824 0.720 0.612 0.516
std 0.001 0.019 0.056 0.100 0.143 0.179 0.204
res TVSAR mean 0.964 0.834 0.684 0.462 0.229 0.013 -0.146
std 0.088 0.378 0.393 0.436 0.487 0.462 0.454
wLMS mean 0.996 0.880 0.648 0.386 0.131 -0.083 -0.233
std 0.005 0.322 0.487 0.527 0.535 0.526 0.520
SVRpred mean 0.908 0.738 0.639 0.347 0.029 -0.154 -0.323
std 0.044 0.075 0.099 0.164 0.324 0.323 0.359
SARIMA mean 0.979 0.846 0.608 0.231 -0.053 -0.292 -0.414
std 0.019 0.127 0.281 0.469 0.466 0.475 0.479
From Figure 3.25b to Figure 3.25e, RPKM and OPPRED significantly out-
perform all of other methods in all prediction horizons. Except SVRpred, Seasonal
ARIMA does not do well comparing to other methods which are dedicated to the
respiratory motion time series prediction problem.
90
1 1
0.998
0.95
0.996
0.994
0.9
0.992
0.85 0.99
RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA
(a) Prediction horizon h=1 (b) A close view for prediction horizon h=1
1 1
0.9
0.95
0.8
0.9
0.7
0.85
0.6
0.8
0.5

1 1
0.9
0.8
0.8
0.7 0.6
0.6
0.4
0.5
0.4 0.2
0.3
0

1 1
0.8
0.6 0.5
0.4
0.2 0
−0.2
−0.5
(g) Prediction horizon h=25 (h) Prediction horizon h=30
Figure 3.25: Prediction performance of RPKM, OPPRED, res TVSAR, wLMS, SVR-
pred and SARIMA
91
Two Examples of Prediction of the Proposed Methods Figure 3.26 and 3.27
shows two prediction examples to visually demonstrate how exactly well the method
does. The solid blue line represents the observation, red line represents OPPRED
and black dotted line represents RPKM.
op_mea
n
r
aw_mea
n
Figure 3.26: Prediction results of Patient 9 with h=15
92
op_mea
n
r
aw_mea
n
Figure 3.27: Prediction results of Patient 2 with h=15
Weighted-Pattern-Based Variant-Best-Neighbors Prediction Using Weighted
Orthogonal Polynomials Approximated Respiratory Motion Time Series
The relative importance of different parts of a time series segment varies. Figure 3.28
shows an example of approximation errors. Without weights, the following approxi-
mation errors appear to be the same, so we propose a weighted orthogonal polynomials
pattern matching to distinguish their difference.
In local scale of respiratory motion time series, the latest data is more important
than the older data, recalling the correlation of errors before and after time tλ of the
best neighbors as an example. Referring back to Figure 3.15, we know that the error
close to and before the referenced time is correlated to the error after the referenced
time. As we desire this kind of flexibility in our respiratory motion prediction problem,
we generate an idea of adding weights to the errors of time series.
To improve our algorithm, we propose adding weights to either or both the
errors of the polynomials approximation and the pattern matching.
93
Figure 3.28: This example shows that even the two time series have the same amount
of error but the occurrences of the errors can be very different. The above plot shows
that the two patterns match very well in the older data (left) but do not match well
in the newest data. Therefore, for prediction, we would prefer the below one.
Many distance functions have been developed throughout the history of time
series studies. Lp − norm which is a common family of distance measurement, has
the following definition.
Definition 3.1 Lp − norm : Given two time series R and S of the same length N,
the Lp − norm distance between R and S is:
v
u N
uX
p
Lp − norm(R, S) = t (ri − si )p (3.53)
i=1
Even R-squares can be seemed as a variant of euclidean distance (p=2) when we
compare to a fixed referenced time series which is the current pattern in our method.
94
So, adding a weight to our similarity metric, equation 3.52, can be generalized
as below.
Definition 3.2 The weighted Lp − norm is then defined as:
v
u N
uX
p
Lp − norm(R, S, W ) = t wi (ri − si )p (3.54)
i=1
where wi is the weight for the distance of the pair of the ith samples.
3.4.3 Weighted Orthogonal Polynomials Approximations
The conventional orthogonal polynomials approximation considers all points of
a time series are equally important and the regression is to approximate the whole time
series with a minimum overall error. However, flexibility can be given to orthogonal
polynomials approximation. In our study, we desire a more accurate approximation
in the latter data than in the older ones. To achieve this, weights, b, are added to
the approximation error during regression. Equation 3.55 then becomes:
(Fw − y)T b(Fw − y) = (Fw − y)T (bFw − by)

(3.55)
= wT FT bFw − 2yT bFw + yT by
Then,
∂kFw − yk2
= 2FT bFw − 2FT by (3.56)
∂w
Putting 2FT bFw − 2FT by = 0, we have
2FT bFw = 2FT by (3.57)
95
The coefficients, wLS can then be written as:
wLS = (FT bF)−1 FT by

   
1 1
 f0 (x0 ) b0 kf0 k2 · · · f0 (xN ) bN kf0 k2  b0   y0 
√ √
 .. .. ..   ..  . 
  .. 
= . . . 
 .  
   
1 1 (3.58)
fK (x0 ) b0 kf k2 · · · fK (xN ) bN kfK k2
√ √ bN yN
K
PN yn √bn
 
 n=0 kf0 k2 f0 (xn ) 
 .. 
= . 

P √ 
N yn bn
n=0 kfK k2 fK (xn )
Under the current framework, the best neighbors are found by a 2-step pat-
tern searching. An alternative is to give weights for the two windows and combine
their similarity values with weights. This variant can give a very similar result to the
proposed multiple steps pattern searching method and it is more integrated mathe-
matically.
It is desired to keep the most of the data to be accurate and only give relaxation
on the oldest data. Figure 3.29 demonstrates the weights considered in our study. In
validation process, we can try multiple sets of weights and select the one giving the
best performance.
3.4.4 Weighted Time Series Pattern Matching
During pattern matching of the time series, we may want to emphasize the
importance of some part of the time series. Therefore, we propose a weighted time
series pattern matching. Similar work has been done by Jeong [40] who proposed
weighted dynamic time warping (WDTW). Dynamic time warping (DTW) is a kind of
distance measure for time series. Similarly, for weighted time series pattern matching,
weights are added to the time series to penalize the dissimilarity of the different part
96
observations
1 weight(short window)
weight(long window)
0.8
data & weight
0.6
0.4
0.2
0
0 20 40 60 80 100
time
Figure 3.29: The weights of the shorter window (black dotted) and the longer window
(red dotted)
of the pattern to achieve more flexibility on pattern matching. In respiratory motion
time series prediction, the latest data is intuitively more important than the older
data.
To implement weights into pattern matching, the computation of similarity of
time series segments, i.e. equation 3.49, is modified as

Pi=Lw
((un (i) − ū0 )wt )2
Sn = 1 − Pi=1
i=Lw
(3.59)
2
i=1 ((u0 (i) − ū0 )wt )
A Simulation Study on Noise-Added Time Series Data A simulation study is
done to validate whether OPPRED is robust to noisy data. In this study, an artificial
generated noise similar to Figure 3.31 are added to the real respiratory motion data
97
of the first 4 patients of 27 patients. One of the settings is to generate short sporadic
noise while another setting is to generate relatively longer and sparse noise as shown
in Figure 3.31.
Table 3.6 shows the mean and standard deviation of the prediction performance
(R-squares) of RPKM and OPPRED on noise-added time series data with prediction
horizon ranging from 1 to 30. It shows that if the time series is noisy, OPPRED
will perform a little bit better than RPKM. In addition to the data sparsity of OP-
PRED, these makes OPPRED a suitable algorithm for respiratory motion time series
prediction.
squares, of the proposed approaches on first 4 patients noise-added respiratory motion
time series.
RPKM(noise) mean 0.975 0.946 0.876 0.783 0.674 0.565 0.479
std 0.010 0.025 0.062 0.104 0.154 0.191 0.216
OPPRED(noise) mean 0.973 0.947 0.879 0.791 0.686 0.578 0.487
std 0.011 0.026 0.061 0.101 0.146 0.180 0.206
3.5 Discussion and Conclusion
In this study, we developed a pattern matching based semi-periodic time series
prediction framework and applied it on respiratory motion time series prediction.
In radiotherapy, system latencies need to be compensated for accurate irradiation
during treatment. Accurate respiratory motion prediction can minimize the damage
of normal body tissues and important human organs.
Pattern matching can effectively utilize the existing information from the data.
Similar pattern demonstrates similar trends in the response. The pattern recognition
98
1 1
0.99
0.95
0.98
0.97 0.9
0.96
0.85
0.95
0.94
RPKM(with noise1) OPPRED(with noise1) RPKM(with noise1) OPPRED(with noise1)
(a) Prediction horizon h=1 (b) Prediction horizon h=5

1 1
0.95
0.9
0.9
0.8
0.85
0.8 0.7
0.75
0.6
0.7
0.65 0.5

1 1
0.9
0.8
0.8
0.7
0.6
0.6
0.5 0.4
0.4
0.2

1
0.8
0.6
0.4
0.2
0
RPKM(with noise1) OPPRED(with noise1)
Figure 3.30: Prediction performance of RPKM, OPPRED on noise-added data
99
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
3.36 3.38 3.4 3.42 3.44 3.46 3.48
4
x 10
Figure 3.31: An example of noise-added time series in the simulation study. Simulated
noise is added to respiratory time series data of a patient
process is enhanced by combining with statistical and feature analysis which help to
obtain better matched patterns and remove undesired patterns. The experimental
results show that the prediction of the proposed method is very accurate and it is
also robust to different kind of patients. For h = 5, most of the patients are able
to attain 0.95 R-squares. This method should contribute greatly to tumor position
prediction in order to help the cancer patients to enhance their life quality.
Looking at the autocorrelation of height and interval, we know that respiratory
motion time series are autoregressive. So the values in next cycle may be determined
by previous cycles. This theoretically supports our method which uses similar patterns
to predict future values.
In the simulation study of noise-added data, it shows that OPPRED is more
robust to noise and drifting than RPKM. Since the proposed method design to do
prediction on respiratory motion time series which demonstrates tremendous individ-
uality, it shows its potential on other applications that demonstrate the characteristics
of semi-periodic time series.
100
3.5.1 Future Studies
In the future, improvements are possible in the following aspects.
Finding Better Pattern Matching Methods The current 2-window design is
intended to consider both short and long range pattern of the time series.
In the future, we will do more experiments on the weighted pattern matching
and the weighted orthogonal approximations which shows great potential to improve
our methods, will be discussed in detail in section 3.4.3 and 3.4.4.
Finding Better Distance Measurement Currently, R-squares is used for simi-
larity measurement of the time series. It provides a nice control mechanism on the
time series similarity by providing a universal measurement. However, R-squares is
sensitive to window length and is not perfect. In the future, we may figure out a
better similarity measurement.
Generalizing the current 2-window design to n-window design The 2-
window design can be in fact seemed as a special case of n-window design with weight
zeros on other windows. Theoretically, the windows can be infinitively long. How-
ever, obviously, the resulting patterns found by longer windows are less important
than those with the shorter ones. For instance, in respiratory motion, we already
know that the patterns of half cycle are more predictive than the patterns of one
cycle.
Optimizing Parameters Intuitively, not every part of a time series segment has
the same importance on obtaining the most effective approximations or the best
neighbors of time series patterns.
101
Generally, our method has 4 ways to seek for improvement: the first one is
to improve the pattern matching process; the second one is to improve the way of
how we use the obtained best matching patterns for prediction; the third one is to
optimize the parameters of the methods, such as the optimal window sizes of the two
windows and the weights in pattern matching and orthogonal approximation; and the
final one is to optimize the parameters involving in our algorithm, such as window
sizes and parameters of SVR..
102
CHAPTER 4
Pattern Recognition and Classification of Multivariate Time Series Signals: EEG
Study of Musicians and Non-Musicians
4.1 Introduction
There has been much interest in the beneficial effects of musical training on
cognition. Previous studies have indicated that musical training was related to better
working memory and that these behavioral differences were associated with differ-
ences in neural activity in the brain. However, it was not clear whether musical
training impacts memory in general, beyond working memory. By recruiting pro-
fessional musicians with extensive training, we investigated if musical training has
a broad impact on memory with corresponding electroencephalography (EEG) sig-
nal changes, by using working memory and long-term memory tasks with verbal and
pictorial items. Behaviorally, musicians outperformed on both working memory and
long-term memory tasks. A comprehensive EEG pattern study has been performed,
including various univariate and multivariate features, time-frequency (wavelet) anal-
ysis, power-spectra analysis, and deterministic chaotic theory. The advanced feature
selection approaches have also been employed to select the most discriminative EEG
and brain activation features between musicians and non-musicians. High classifica-
tion accuracy (more than 95%) in memory judgments was achieved using Proximal
Support Vector Machine (PSVM). For working memory, it showed significant differ-
ences between musicians versus non-musicians during the delay period. For long-term
memory, significant differences on EEG patterns between groups were found both in
the pre-stimulus period and the post-stimulus period on recognition. These results
103
indicate that musicians memorial advantage occurs in both working memory and long-
term memory and that the developed computational framework using advanced data
mining techniques can be successfully applied to classify complex human cognition
with high time resolution.
4.2 Methodology
4.2.1 Data Acquisition and Experimental Settings
Participants 36 participants were initially recruited for the study. Four partici-
pants were excluded for having negative d values on the long-term memory task, two
were excluded for failing to follow directions, and one was excluded as a behavioral
outlier (more than 3 SDs from mean long-term memory performance). In total, 29
subjects were included in the analyses; 14 (5 female) were professional musicians with
>10 years of experience (M = 22.9 years of experience) and 15 were nonmusicians
(8 female) with no musical training. Informed consent was obtained from all partic-
ipants in accordance with the experimental protocol approved by the University of
Texas Institutional Review Board.
Design of the Experiments Participants completed a study session followed by
a test session involving words and pictures as stimuli. Stimuli were presented visually
on a computer and all responses were made using the keyboard. During the study
session, participants were presented with pairs of stimuli, one at a time. Each study
trial began with a fixation cross (250 ms), the first stimulus (1000 ms), a blank screen
(5000 ms), the second stimulus (2500 ms or until a response), and finally a blank
screen (1000 ms). Upon presentation of the second stimulus, participants made a
judgment of whether the second stimulus was the same as the first (Figure 4.1a).
104
A few minutes following the study session, participants memory was tested.
During this test session, stimuli presented during study were presented again along
with new stimuli that had not been studied. Further, we only tested participants
memory on stimuli that had only been presented once. Therefore, only stimuli pre-
sented on trials that were different during the study session (i.e. trials on which
the second stimulus was different from the first) were presented during test. Each
test trial began with a fixation (250 ms), followed by a stimulus (3000 ms or until
a response), and then a blank screen (1250 ms). Upon presentation of the stimulus,
participants made a memory judgment which included a rating of how confident they
were in their memory (Figure 4.1b). They were allowed to make three responses:
remember with low confidence, remember with low confidence, remember with high
confidence, or new.
Word and picture stimuli were blocked for both study and test phases, such
that each participant was presented with a block of word trials followed by a block of
picture trials (or vice versa). Whether or not participants were presented with words
or pictures first was randomly determined for each participant.
Types of Stimuli Participants were presented with pictures of complex scenes and
words. During the study session, participants completed 96 trials of pictures (32
same, 64 different) and 96 trials of words (32 same, 64 different). Given that each
trial contained two stimulus presentations, participants studied a total of 128 pictures
and 128 words from different trials. These stimuli were used to test long-term memory
during the test session. During the long-term memory task, participants completed
192 trials of pictures (128 studied, 64 new) and 192 trials of words (128 studied, 64
new).
105
(a) A. Study
(b) B. Test
Figure 4.1: Schematic of experimental paradigm. A1 to A5) During study period,

participants were asked to judge whether the second stimulus matched the first. B1 to
B3) During test period, participants made memory judgments to stimuli while rating
their confidence. Low represents remember with low confidence, High represents
remember with high confidence, and New represents a judgment where participants
thought the stimulus was not studied.
EEG data EEG data were collected during both study and test sessions using
the Brain Vision ActiChamp 32 channel system and recorded using the Pycorder
software. Electrode positions followed the 10-20 system and included Fz, Cz, Pz, Oz,
Fp1, Fp2, F3, F4, F7, F8, Fc1, Fc2, Fc5, Fc6, Ft9, Ft10, T7, T8, C3, C4, Cp1, Cp2,
Cp5, Cp6, Tp9, Tp10, P3, P4, P7, P8, O1, and O2 (Figure 4.2). During recording,
data were sampled at 1000 Hz and filtered between .01 and 100 Hz. Offline, data
were high-pass filtered with a 0.1 Hz Butterworth filter, downsampled to 256 Hz,
and referenced to the average of the mastoids (TP9 and TP10). Post-stimulus ERPs
with a 1000 ms duration were extracted and were baseline-corrected with respect to a
106
200 ms prestimulus baseline. Visual inspection was then used to remove epochs that
contained eye blinks and movement artifacts.
Figure 4.2: The Map of the channel locations
107
4.2.2 Artifacts Removal
Brain signals often contain significant artifacts that lead to major problems in
signal analysis, when the activity due to artifacts has a higher amplitude than the one
due to neural sources. The common sources of artifacts include eye movements, mus-
cle contractions, electric devices interference [41]. Independent Component Analysis
(ICA) has been successfully applied for artifacts removal in many studies. The basic
idea is to decompose the brain data into independent components, determine the ar-
tifacted components using pattern and source localization analysis, and reconstruct
the brain signals by excluding those artifacted components. However, linking com-
ponents to artifact sources (e.g., eye blinking, muscle movements) remains largely
user-dependent. In this study, we employed a recently developed automatic ICA-
based algorithm, called ADJUST [42], for signal artifact removal. ADJUST applies
stereotyped artifact-specific spatial and temporal features to identify independent
components of artifacts automatically. These artifacts can be removed from the data
without affecting the activity of neural sources [42]. The data analysis in the following
is based on the ’cleaned’ data after artifact removal.
4.2.3 Signal Feature Extraction
We extensively investigated features from the collected physiological signals.
Four groups of feature extraction techniques were employed to capture signal charac-
teristics that may be relevant to assess memory workload. They were signal power,
statistical, morphological, and wavelet features. For a data epoch with n channels,
we first extracted features from signals at each channel, and then concatenated the
features of all the n channels to construct the feature vector of the data epoch. Let
X = {x1 , x2 , · · · , xm } denote a single-channel signal with m points, the feature ex-
traction of four groups of signal features are described as follows.

108
Figure 4.3: Artifact Removal Using ICA
Signal Power Features: Adopting the signal features used in a previous work
[43], we computed signal power for each channel in every nonoverlap 2-Hz intervals
from 4-40 Hz. The 18 power features provide finer signal power spectrum information
than the commonly used brain signal frequency bands, such as theta, alpha, beta,
and gamma bands.
Band Power Asymmetry: While it is well-known that emotion states is
associated with the power asymmetry of the EEG signals, it is unknown that if power
asymmetry can serve as an indicator of musicians and non-musicians. We studied
the power asymmetry in 2 ways: inter-hemispheric asymmetry or intra-hemispheric
asymmetry, as shown in the following channel locations map. [44]
Statistical Features: Four most widely used statistical measures, mean, vari-
ance, skewness, and kurtosis, to characterize the distribution of signal amplitudes. In
particular, mean is the averaged signal amplitude, and variance measures the signal
variability to the mean. The high order statistics skewness quantifies the extent of
109
(a) Topography of the event (b) Topography of the event (c) Topography of the event
of Eye Blink of Vertical Eye Movement of Horizontal Eye Movement
(d) Topography of the event (e) Topography of the event

of Generic Discontinuity of Neural
Figure 4.4: Topographies for ICA-Based Artifact Removal
110
Figure 4.5: Group of Channels for Inter- and Intra-hemispheric power band asymme-
try. For Inter-hemispheric power band asymmetry, the value is calculated by pairs of
same colors over another hemisphere. For Intra-hemispheric power band asymmetry,
the value is calculated by pairs of different colors within the same hemisphere.
111
the distribution leans to one side of the mean, and kurtosis measures the ‘peakedness’
of the distribution.
Morphological Features: Three morphological features were extracted to de-
scribe morphological characteristics of a single-channel signal. These features showed
their usefulness in our previous studies of brain signals [45, 46]. A brief description
of the morphological features is given in the following.
• Curve Length: also known as ‘line length’ which was first proposed by Olsen et
al. [47]. Curve length is the sum of distances between successive points, given
by m−1
X
|xi+1 − xi |. (4.1)
i=1
Since curve length increases as the signal magnitude or frequency increases, it
is a measure of amplitude-frequency variations of a signal. It has been used in
many brain signal studies, such as epileptic seizure detection [48], stimulation
responses of the brain [49].
• Number of Peaks: a widely used characteristic to measure the overall frequency
of a signal. The number of peaks in a signal X can be calculated by

m−2
1X
max{0, sgn(xi+2 − xi+1 ) − sgn(xi+1 − xi )}. (4.2)
2 i=1
• Average Nonlinear Energy: nonlinear energy was first proposed by Kaiser [50].
It has been found that the nonlinear energy is sensitive to spectral changes.
Thus, it is useful to capture spectral information of a signal [51]. The average
nonlinear energy of the single-channel signal X is computed as

m−1
1 X 2
x − xi−1 xi+1 . (4.3)
m − 2 i=2 i
Time-Frequency Features: Wavelet transform (WT) is a powerful tool to
perform time-frequency analysis of signals. The fundamental idea of WT is to rep-
resent a signal by a linear combination of a set of functions obtained by shifting or

112
dilating a particular function called mother wavelet [52]. The WT of a signal X(t) is
defined as
t−b
Z
1
C(a, b) = X(t) √ Ψ( )dt (4.4)
R a a
where Ψ is the mother wavelet, C(a, b) are the WT coefficients of the signal X(t), a
is the scale parameter, and b is the shifting parameter. Continuous wavelet transform
(CWT) has a ∈ R+ and b ∈ R, and discrete wavelet transform (DWT) has a = 2j
and b = k2j for all (j, k) ∈ Z given the decomposition level of j. Since CWT
explores every possible scale a and shifting b, it is generally a lot more computationally
expensive than DWT. As a result, DWT is often used to perform time-frequency
analysis of a signal at different decomposition levels [53]. The DWT coefficients
provide a non-redundant and highly efficient representation of a signal in both time
and frequency domain. At each level of decomposition, DWT works as a set of band-
pass filters to divide a signal into two bands called approximations and details signals.
The approximations (A) are the low frequency components of the signal, and the
details (D) are the high-frequency components. Among different wavelet families, we
employed Daubechies wavelet as it is frequently used in physiological signal analysis
due to its orthogonality property and efficient filter implementation [54]. A 4-level
DWT decomposition was applied to the collected signals with the sampling rate of
128 Hz. Table 4.1 lists the decomposed signals A4, D4, D3, D2, D1, which roughly
corresponded to the commonly recognized brain signal frequency bands delta, theta,
alpha, beta, and gamma, respectively.
After the 4-level DWT decomposition, a set of wavelet coefficients were obtained
for each decomposed signals. To further decrease feature dimensionality, we employed
a popular measure called wavelet entropy (WE), which indicates the degree of multi-
113
frequency signal order/disorder in the signals [55]. To obtain wavelet entropy, the first
step is to calculate relative wavelet energy for each decomposition level as follows
Ej Ej
pj = = Pn , (4.5)
Etot j=1 Ej
where j is the resolution level, n is the number of selected resolution levels for analysis,
and n = 5 in this study. Ej is the wavelet power at decomposition level j. It is
calculated by the summation of the squared values of the wavelet coefficients at level
i. The relative wavelet energy pj can be considered as the power density of the
decomposed signal level j. It satisfies nj=1 pj = 1. According to Shannon entropy

P
[56] for analyzing and comparing probability distributions, the WE is defined similarly
by
n
X
WE = − pj × ln(pj ), (4.6)
j=1
where pj is the relative wavelet energy of resolution level j. The wavelet entropy
offers a suitable tool for characterizing the order/disorder of signals powers in the
five brain signal frequency bands (delta, theta, alpha, beta and gamma) during the
n-back task. For example, if relative wavelet energy at a resolution level i (e.g., alpha
band) is dominant over others, such as pi equals almost one, and all other relative
wavelet energies are almost zero. In this case, the wavelet entropy will be a very small
value near zero. On the other hand, the relative wavelet energies are almost equal for
all resolution levels, then the WE will reach its maximum value.
4.2.4 Feature Vector Classification Using Proximal Support Vector Machine (PSVM)
Classification Method In the experiments, we collected data from four difficulty
levels (0-, 1-, 2-, 3-back). A popular binary classification technique, support vector
machine (SVM), was employed to investigate the data separability at different mental
workload levels. SVM techniques have been successfully applied in many classification
114
Table 4.1: Frequency ranges and the corresponding brain signal frequency bands of
the five levels of signals by discrete wavelet decomposition.
Decomposed Level Frequency Range (Hz) Approximate Band
D1 32-64 Gamma
D2 16-32 Beta
D3 8-16 Alpha
D4 4-8 Theta
A4 0-4 Delta
problems [57, 58, 59, 60, 61]. The fundamental problem of SVM is to build an optimal
decision boundary to separate two categories of data. Let Y denote a n×k dimensional
feature vector for a multi-channel data session at certain difficulty level, where n is
the number of signal channels and k is the number of features of each channel. To
classify data with two workload levels, let l denote the sample class label and l = 1
denotes one workload level, and l = −1 means the other workload level.
Assume we have p sessions of level one denoted by S1 = {(Y1 , l1 ), (Y2 , l2 ), ..., (Yp , lp )},
and q sessions of level two denoted by S2 = {(Yp+1 , lp+1 ), (Yp+2 , lp+2 ), ..., (Yp+q , lp+q )}.
Each session is represented by a n × k dimensional feature vector. One can find
infinitely many hyperplanes in Rn×k to separate the two data groups. Based on
statistics learning theory (STL), a SVM selects a hyperplane which maximizes its
distance from the closest point from the samples. This distance is referred to as mar-
gin. The standard SVM formulation that maximizes the margin and minimizes the
training error is as follows:
Pp+q
minω,ξ,b { 12 kωk2 + C i=1 ξi : D(Y T ω + be) ≥ e − ξi }, (4.7)
where ω is the weight vector, and the slack variables ξ is introduced to measure the
degree of misclassification during training. The penalty cost C is used to control the
tradeoff between a large margin and a small prediction error penalty. Each column
of Y is an observation Yi , D is a diagonal matrix with class-label elements Dii equal

115
to 1 if Yi belongs to one class, or -1 otherwise. The vector e has all its elements equal
to one. The first term of the objective function in 4.7 is due to maximize the margin
of separation 2/kwk, and the second term measures how much emphasis is given to
the minimization of the training error.
Since the standard SVM classifiers usually require a large amount of computa-
tion time for training, the Proximal SVM (PSVM) algorithm was introduced Man-
gasarian and Wild [62] as a fast alternative to the standard SVM formulation. The
formulation for the linear PSVM is as follows:
minω,ξ,b { 21 (kωk2 + b2 ) + 21 Cξi T ξi : D(Y T ω + be) = e − ξi }, (4.8)
where the traditional SVM inequality constraint is replaced by an equality con-
straint. This modification changes the nature of the support hyperplanes (ω T Y + b =
±1). Instead of bounding planes, the hyperplanes of PSVM can be thought of as
‘proximal’ planes, around which the points of each class are clustered and which are
pushed as far apart as possible by the term (kωk2 +b2 ) in the above objective function.
It has been shown that PSVM has comparable classification performance to that of
standard SVM classifiers, but can be an order of magnitude faster [62]. Therefore,
we employed PSVM in this study.
Training and Evaluation A classification problem generally follows a two-step
procedure which consists of training and testing phases. During the training phase, a
classifier is trained to achieve the optimal separation for the training data set. Then
in the testing phase, the trained classifier is used to classify new samples with un-
known class information. The N-fold cross-validation is an attractive method of model
evaluation when the sample size is small. It is capable of providing almost unbiased
estimate of the generalization ability of a classifier [63]. For the 29 subjects, the total
116
number of data samples (trials) for session A and B are 128 and 386 respectively. We
designed a 2-fold cross-validation method to train and evaluate the SVM classifier.
To explore the differences of the responses of musicians and non-musicians un-
der various events, we separate the data into 5 and 3 epochs for session A and B
respectively, as shown in Figure 4.1. Based on the event markers of the EEG data,
we further define 23 conditions for each session. The following table lists all of the
conditions.
Table 4.2: A list of all comparison conditions of the experiments. For comparison con-
ditions 4 to 11, the naming structure was stimuli/grand truth/response. For conditions
12 to 23, the naming structure was stimuli/response to that stimuli in test session/if
it was the 1st or 2nd stimuli. For conditions 35 to 46, it was stimuli/confidence level
of having seen the stimuli/correctness
Group A Group B
condition event condition event
1 all samples 24 all samples
2 picture 25 picture
3 word 26 word
4 picture - same - same 27 picture - long term Low
5 picture - same - diff 28 picture - long term High
6 picture - diff - diff 29 picture - long term New
7 picture - diff - same 30 word - long term Low
8 word - same - same 31 word - long term High
9 word - same - diff 32 word - long term New
10 word - diff - diff 33 picture - correct
11 word - diff - same 34 word - correct
12 picture - long term Low - stim1 35 picture - low confidence - correct
13 picture - long term High - stim1 36 picture - high confidence -correct
14 picture - long term New - stim1 37 picture - new - correct
15 word - long term Low - stim1 38 picture - low confidence - wrong
16 word - long term High - stim1 39 picture - high confidence -wrong
17 word - long term New - stim1 40 picture - new - wrong
18 picture - long term Low - stim2 41 word - low confidence - correct
19 picture - long term High - stim2 42 word - high confidence -correct
20 picture - long term New - stim2 43 word - new - correct
21 word - long term Low - stim2 44 word - low confidence - wrong
22 word - long term High - stim2 45 word - high confidence -wrong
23 word - long term New - stim2 46 word - new - wrong
117
For each comparison group, we divided the corresponding data samples into 5
non-overlapping subsets. Each time we picked one subset out and trained the PSVM
classifier by the data samples of another set. The samples of the left-out subset were
considered as unknown samples to test the performance of the trained classifiers.
Repeating this procedure again for another set, the averaged prediction accuracy
over the 5-fold runs was used to indicate the degree of separability of the EEG signals
of musicians and non-musicians.
To achieve reliable feature selection, we employed an advanced feature selection
technique, called minimum redundancy maximum relevance (mRMR) [64], which
allows us to select a subset of superior features at a low computational cost in a high
dimensional space.
The basic idea of mRMR is to select the most relevant features with respect to
class labels while minimizing redundancy amongst the selected features. The mRMR
algorithm uses mutual information as a distance measure to compute feature-to-
feature and feature-to-class-label non-linear similarities.
For two features X and Y, p(X) and p(Y) are marginal probability functions,
and p(X, Y ) is the connected probability distribution while I(X, Y ) is the amount of
mutual information of a and b:

XX p(x, y)
I(X; Y ) = p(x, y) log , (4.9)
y∈Y x∈X
p(x) p(y)
The mRMR method aims to minimize redundancy (Rd) while maximizing rel-
evance (Re) amongst the features. The Rd and Re by the following definition:
1 X
Rd = I(i, j) (4.10)
|S|2 i,j∈S
1 X
Re = I(h, j) (4.11)
|S| i∈S
118
Where S is the set of features, h are the target class labels, and I(i, j) is the mu-
tual information between features i and j. The feature selection criterion combining
the above two constraints is the mRMR, for which the objective function of feature
selection can be defined by
φ(D, R) = D − R. (4.12)
An optimal subset of features are the ones that maximize the above mRMR objective
function.
4.3 Result Discussion
Table 4.3 shows the classification accuracy for 46 conditions and 8 epochs with
5-fold cross validation and 10 features selected by mRMR and without any ICA
artifacts removal. The classification accuracies mostly ranges from 60% to 85%. Some
conditions can go as high as 90%, such as condition 21 epoch A3 and condition 26
epoch B2. The highest accuracy which is 94.59% occurs in condition 30 and epoch
B1.
Table 4.4 shows the classification accuracy for 46 conditions and 8 epochs with
5-fold cross validation and 10 features selected by mRMR and with ICA artifacts
removal. The classification accuracies mostly ranges from 70% to 85%. Some con-
ditions can go as high as 90%, such as condition 2 epoch A4, condition 14 epoch 3,
condition 15 epoch A4 and more. The highest accuracy which is 97.30% occurs in
condition 20 and epoch A4. It is generally better than directly using the raw data
and without any ICA artifact removal.
To sum up, from the 2 classification settings we have tried, we found that 5-fold
cross validation and 10 features selected by mRMR with ICA artifacts removal gives
better results. And, Epoch A4 generally gives better classification result. Looking at
119
Table 4.3: The table of the classification accuracy for 46 conditions and 8 epochs
with 5-fold cross validation and 10 features selected by mRMR and without any ICA
artifacts removal.
Epoch Epoch
condition A1 A2 A3 A4 A5 condition B1 B2 B3
1 51.35 78.38 83.78 81.08 70.27 24 59.46 78.38 64.86
2 51.35 70.27 64.86 72.97 67.57 25 86.49 81.08 54.05
3 81.08 70.27 67.57 86.49 83.78 26 67.57 72.97 72.97
4 - - - - - 27 72.97 91.89 70.27
5 - - - - - 28 64.86 81.08 81.08
6 64.86 78.38 62.16 75.68 72.97 29 70.27 89.19 59.46
7 80.00 68.57 88.57 74.29 77.14 30 94.59 86.49 89.19
8 - - - - - 31 67.57 83.78 72.97
9 - - - - - 32 70.27 81.08 72.97
10 75.00 80.56 77.78 86.11 77.78 33 72.97 81.08 56.76
11 - - - - - 34 59.46 78.38 70.27
12 67.57 64.86 78.38 78.38 70.27 35 78.38 67.57 62.16
13 59.46 67.57 75.68 78.38 75.68 36 72.97 83.78 75.68
14 62.16 78.38 70.27 75.68 89.19 37 75.68 86.49 51.35
15 56.76 78.38 75.68 81.08 78.38 38 62.16 70.27 56.76
16 81.08 81.08 83.78 86.49 78.38 39 74.29 57.14 65.71
17 72.97 64.86 75.68 72.97 67.57 40 78.38 81.08 81.08
18 59.46 81.08 62.16 81.08 72.97 41 72.97 72.97 72.97
19 62.16 64.86 72.97 72.97 78.38 42 72.97 75.68 72.97
20 83.78 59.46 51.35 67.57 75.68 43 81.08 81.08 62.16
21 86.49 81.08 91.89 70.27 81.08 44 75.00 63.89 63.89
22 64.86 86.49 64.86 89.19 51.35 45 62.16 70.27 70.27
23 56.76 70.27 78.38 75.68 81.08 46 83.78 72.97 72.97
the selected features of the highest accuracy setting, we may find the major difference
of EEG signals between musicians and non-musicians under certain condition.
For epoch 4 and condition 20, the classifier selected F1, F8, F14 and F18 ex-
tensively. They are mean, variance, skewness, kurtosis, relative band power, wavelet
entropy and wavelet statistics.
Figure 4.6 shows the comparison for the EEG signals of 30 channels of musicians
and non-musicians for epoch B1 under condition 30. In this case, the PSVM classi-
fier reaches 97.30% classification accuracy. From the plots, we can also observe the
120
Table 4.4: The table of the classification accuracy for 46 conditions and 8 epochs with
5-fold cross validation and 10 features selected by mRMR and with ICA artifacts
removal
Epoch Epoch
condition A1 A2 A3 A4 A5 condition B1 B2 B3
1 86.49 64.86 72.97 81.08 62.16 24 56.76 67.57 62.16
2 78.38 78.38 75.68 91.89 67.57 25 48.65 78.38 67.57
3 72.97 75.68 78.38 70.27 62.16 26 86.49 64.86 62.16
4 - - - - - 27 59.46 83.78 78.38
5 - - - - - 28 78.38 78.38 70.27
6 83.78 81.08 86.49 81.08 72.97 29 81.08 72.97 56.76
7 71.43 68.57 65.71 82.86 77.14 30 97.30 75.68 64.86
8 - - - - - 31 59.46 81.08 64.86
9 - - - - - 32 81.08 72.97 81.08
10 75.00 69.44 72.22 66.67 50.00 33 72.97 78.38 56.76
11 - - - - - 34 54.05 89.19 81.08
12 75.68 78.38 62.16 72.97 75.68 35 78.38 86.49 72.97
13 64.86 64.86 78.38 89.19 70.27 36 83.78 67.57 64.86
14 81.08 72.97 91.89 62.16 83.78 37 75.68 78.38 70.27
15 89.19 67.57 59.46 91.89 81.08 38 86.49 91.89 54.05
16 78.38 78.38 83.78 70.27 70.27 39 71.43 71.43 68.57
17 81.08 75.68 67.57 72.97 75.68 40 78.38 86.49 62.16
18 75.68 72.97 75.68 78.38 81.08 41 75.68 83.78 70.27
19 75.68 78.38 83.78 83.78 56.76 42 78.38 59.46 54.05
20 72.97 72.97 67.57 97.30 67.57 43 86.49 78.38 75.68
21 78.38 67.57 78.38 75.68 70.27 44 72.22 91.67 72.22
22 62.16 67.57 78.38 81.08 72.97 45 78.38 83.78 67.57
23 81.08 70.27 81.08 59.46 81.08 46 62.16 67.57 67.57
significant differences between 2 groups of people. Figure 4.7 shows that musicians
tend to be more activated in the memory test.
4.4 Summary and Future Work
In conclusion, the method satisfactorily predict the class of subjects. The high-
est successful rate is 97.30% which occurs in condition 30 and Epoch B1.
Because different events may have different responses, so based on the event
markers, the sessions are separated into several small parts for detailed analysis.
121
fp1 fp2
f7 f8
f3 f4
fz
ft9 ft10
fc5 fc6
fc1 fc2
c3 cz c4 t8
t7
cp1 cp2
cp5 cp6
pz
p3 p4
p7 p8
o1 oz o2
−3.78
+3.78
0 246
Time (ms)
Figure 4.6: Comparison for the EEG signals of 30 channels of musicians (blue line)
and non-musicians (red line) at epoch B1 and condition 30.
122
Latency 100 ms from time=100ms
2.3
1.2
−1.2
−2.3
(a) The topography of non-musicians at epoch B1 and condition

30 at 100sec
(b) The topography of musicians at epoch B1 and condition 30

at 100sec
Figure 4.7: Head plot for musicians and non-musicians at epoch B1 at 100sec with
ICA-Based Artifact Removal
123
Table 4.5: The table of the classification sensitivity and specificity for 46 conditions
and 8 epochs with 5-fold cross validation and 10 features selected by mRMR and with
ICA artifacts removal
Epoch Epoch
A1 A2 A3 A4 A5 B1 B2 B3
cond. sen spec sen spec sen spec sen spec sen spec cond. sen spec sen spec sen spec
1 0.63 0.39 0.63 0.94 0.89 0.78 0.95 0.67 0.84 0.56 24 0.74 0.44 0.84 0.72 0.68 0.61
2 0.58 0.44 0.58 0.83 0.68 0.61 0.79 0.67 0.74 0.61 25 0.79 0.94 0.84 0.78 0.58 0.50
3 0.89 0.72 0.79 0.61 0.63 0.72 0.84 0.89 0.89 0.78 26 0.68 0.67 0.74 0.72 0.79 0.67
4 - - - - - - - - - - 27 0.84 0.61 0.95 0.89 0.84 0.56
5 - - - - - - - - - - 28 0.84 0.44 0.79 0.83 0.79 0.83
6 0.68 0.61 0.74 0.83 0.53 0.72 0.68 0.83 0.63 0.83 29 0.68 0.72 0.95 0.83 0.58 0.61
7 0.89 0.69 0.84 0.50 1.00 0.75 0.74 0.75 0.84 0.69 30 0.89 1.00 0.89 0.83 0.89 0.89
8 - - - - - - - - - - 31 0.63 0.72 0.89 0.78 0.74 0.72
9 - - - - - - - - - - 32 0.79 0.61 0.89 0.72 0.74 0.72
10 0.79 0.71 0.84 0.76 0.84 0.71 0.84 0.88 0.84 0.71 33 0.79 0.67 0.95 0.67 0.32 0.83
11 - - - - - - - - - - 34 0.58 0.61 0.89 0.67 0.74 0.67
12 0.68 0.67 0.63 0.67 0.79 0.78 0.79 0.78 0.79 0.61 35 0.89 0.67 0.79 0.56 0.58 0.67
13 0.47 0.72 0.68 0.67 0.84 0.67 0.79 0.78 0.84 0.67 36 0.68 0.78 0.95 0.72 0.63 0.89
14 0.58 0.67 0.95 0.61 0.74 0.67 0.74 0.78 0.84 0.94 37 0.79 0.72 0.89 0.83 0.53 0.50
15 0.58 0.56 0.74 0.83 0.74 0.78 0.74 0.89 0.74 0.83 38 0.74 0.50 0.58 0.83 0.63 0.50
16 0.84 0.78 0.84 0.78 0.84 0.83 0.79 0.94 0.84 0.72 39 0.58 0.94 0.53 0.63 0.74 0.56
17 0.74 0.72 0.74 0.56 0.79 0.72 0.68 0.78 0.74 0.61 40 0.89 0.67 0.79 0.83 0.89 0.72
18 0.63 0.56 0.79 0.83 0.63 0.61 0.68 0.94 0.68 0.78 41 0.68 0.78 0.74 0.72 0.79 0.67
19 0.74 0.50 0.74 0.56 0.79 0.67 0.74 0.72 0.79 0.78 42 0.84 0.61 0.84 0.67 0.74 0.72
20 0.95 0.72 0.53 0.67 0.58 0.44 0.79 0.56 0.79 0.72 43 0.84 0.78 0.89 0.72 0.63 0.61
21 0.84 0.89 0.95 0.67 1.00 0.83 0.68 0.72 0.84 0.78 44 0.78 0.72 0.61 0.67 0.61 0.67
22 0.63 0.67 0.89 0.83 0.84 0.44 0.89 0.89 0.53 0.50 45 0.79 0.44 0.79 0.61 0.79 0.61
23 0.53 0.61 0.68 0.72 0.89 0.67 0.74 0.78 0.84 0.78 46 0.95 0.72 0.79 0.67 0.79 0.67
There are only two classes - musicians and non-musicians - in this prediction process.
Univariate features are extracted from the 30 channels of EEG signals. We have
consider Signal Power Features, Band Power Asymmetry, Morphological Features,
Statistical Features and Time-Frequency Features. Artifact removal based on ICA
gives better results than directly using the raw data.
In the future, we will consider some outlier removal techniques on epochs. Bad
data does exist in every EEG data. There are many possible causes, such as muscle
movement or distraction of participant during the experiment. The performance is
expected to be enhanced by removing the contaminated epochs.
124
Table 4.6: The table of the classification sensitivity and specificity for 46 conditions
and 8 epochs with 5-fold cross validation and 10 features selected by mRMR and with
ICA artifacts removal
Epoch Epoch
A1 A2 A3 A4 A5 B1 B2 B3
cond. sen spec sen spec sen spec sen spec sen spec cond. sen spec sen spec sen spec
1 0.84 0.89 0.63 0.67 0.84 0.61 0.84 0.78 0.53 0.72 24 0.53 0.61 0.68 0.67 0.74 0.50
2 0.74 0.83 0.79 0.78 0.89 0.61 0.95 0.89 0.84 0.50 25 0.58 0.39 0.84 0.72 0.53 0.83
3 0.79 0.67 0.79 0.72 0.79 0.78 0.74 0.67 0.58 0.67 26 0.84 0.89 0.84 0.44 0.74 0.50
4 - - - - - - - - - - 27 0.68 0.50 0.89 0.78 0.84 0.72
5 - - - - - - - - - - 28 0.74 0.83 0.84 0.72 0.79 0.61
6 0.74 0.94 0.84 0.78 0.89 0.83 0.89 0.72 0.68 0.78 29 0.79 0.83 0.84 0.61 0.58 0.56
7 0.68 0.75 0.79 0.56 0.74 0.56 0.84 0.81 0.84 0.69 30 0.95 1.00 0.89 0.61 0.63 0.67
8 - - - - - - - - - - 31 0.84 0.33 1.00 0.61 0.84 0.44
9 - - - - - - - - - - 32 0.84 0.78 0.79 0.67 0.79 0.83
10 0.89 0.59 0.74 0.65 0.84 0.59 0.74 0.59 0.42 0.59 33 0.84 0.61 0.89 0.67 0.79 0.33
11 - - - - - - - - - - 34 0.53 0.80 0.79 1.00 0.79 0.83
12 0.79 0.72 0.95 0.61 0.58 0.67 0.89 0.56 0.79 0.72 35 0.84 0.72 0.84 0.89 0.74 0.72
13 0.63 0.67 0.63 0.67 0.74 0.83 0.95 0.83 0.74 0.67 36 0.89 0.78 0.74 0.61 0.58 0.72
14 0.95 0.67 0.68 0.78 0.95 0.89 0.63 0.61 0.74 0.94 37 0.68 0.83 0.79 0.78 0.74 0.67
15 0.95 0.83 0.68 0.67 0.42 0.78 0.95 0.89 0.89 0.72 38 1.00 0.72 0.95 0.89 0.58 0.50
16 0.84 0.72 0.89 0.67 0.89 0.78 0.84 0.56 0.74 0.67 39 0.63 0.81 0.74 0.69 0.68 0.69
17 0.79 0.83 0.84 0.67 0.89 0.44 0.79 0.67 0.68 0.83 40 0.95 0.61 0.95 0.78 0.68 0.56
18 0.68 0.83 0.74 0.72 0.79 0.72 0.74 0.83 0.79 0.83 41 0.79 0.72 0.84 0.83 0.63 0.78
19 0.63 0.89 0.74 0.83 0.84 0.83 0.89 0.78 0.68 0.44 42 0.84 0.72 0.68 0.50 0.58 0.50
20 0.74 0.72 0.74 0.72 0.68 0.67 0.95 1.00 0.63 0.72 43 0.84 0.89 0.84 0.72 0.63 0.89
21 0.74 0.83 0.74 0.61 0.84 0.72 0.68 0.83 0.58 0.83 44 0.72 0.72 1.00 0.83 0.67 0.78
22 0.84 0.39 0.74 0.61 0.79 0.78 0.84 0.78 0.74 0.72 45 0.74 0.83 0.84 0.83 0.89 0.44
23 0.84 0.78 0.68 0.72 0.84 0.78 0.74 0.44 0.74 0.89 46 0.63 0.61 0.68 0.67 0.84 0.50
125
CHAPTER 5
Conclusions and Future Research
This dissertation focuses on the methodologies for addressing the two problems
as introduced in Section 2 which solve prediction problems in healthcare and service
industries. The problems will involve both stationary and nonstationary time series.
Chapter 2 presents the details of application of ARIMA and Dynamic linear
model on stationary time series prediction problem for healthcare and railroad in-
dustries. ARIMA and DLM represent two different ways to explain and model time
series.
Dynamic Linear Model (DLM) which is a special type of State Space methods,
has been developed as alternative tools for time series forecasting . However, to ap-
ply DLM, the signal-to-noise-ratio R has to be specified. Since the true value of R
is generally not available, the only way is to guess a value which is inconvenient and
unreliable. To conquer this problem, we propose a method to estimate R automati-
cally in the forecasting procedure. The properties of the proposed R estimator and
the new forecasting procedure with this estimator are studied by simulations.
In Chapter 3, we described our proposed novel pattern matching based semi-
periodic time series prediction framework and applied it on respiratory motion time
series prediction. In radiotherapy, system latencies need to be compensated for accu-
rate irradiation during treatment. Accurate respiratory motion prediction can mini-
mize the damage of normal body tissues and important human organs.
Pattern matching can effectively utilize the existing information from the data.
Similar pattern demonstrates similar trends in the response. The pattern recognition
126
process is enhanced by combining with statistical and feature analysis which help
to obtain better matched patterns and remove undesired patterns. The experimental
results show that the prediction of the proposed method is very accurate and it is also
robust to different kind of patients. We have compared the proposed novel pattern-
based method to the current state-of-the-art methods and found that the proposed
method outperforms all other methods. It should greatly contribute to tumor position
prediction in order to help the cancer patients to enhance their life quality.
Chapter 4 presents a comprehensive study of EEG time series data mining on
classification of the EEG signals of musicians and non-musicians. The objective of
the study is to predict if an EEG signal belongs to a musician or a non-musicians.
The EEG signals are first cleaned by ICA-based artifacts removal and outlier epoch
rejection. Then, the features of the EEG signals are extracted by using extensive
algorithms. proximal support vector machine (PSVM) is computational friendly and
efficient. The performance of PSVM is usually satisfactory. So, we use PSVM as the
classifier in our study.
To sum up, the method satisfactorily predict the class of subjects. The highest
successful rate is 97.30% which occurs in condition 7 and Epoch 8.
Artifacts removal is a challenging task. We want to remove noisy signals but
retain the useful information. Many work have been done on this problem but not
many give satisfactory result. In our study, we apply ICA to decompose the signal
into ICs and then remove those are considered as components of artifacts before
reconstructing the signal. From the result in Chapter 4, we are able to show that our
artifacts removal does significantly enhance the classification performance.
In future, we will consider some outlier removal techniques on epochs. Bad
data does exist in every EEG data. There are many possible causes, such as muscle
127
movement or distraction of participant during the experiment. The performance is
expected to be enhanced by removal contaminated epochs.
128
REFERENCES
[1] J. G. DeGooijer and R. J. Hyndman, “25 years of time series forecasting,” In-
ternational journal of forecasting, pp. 443–473, 2006.
[2] C. A. Ratanamahatana, J. Lin, D. Gunopulos, E. Keogh, M. Vlachos, and G. Das,
“Mining time series data,” Data Mining and Knowledge Discovery Handbook, pp.
1049–1077, 2010.
[3] G. Rubio, H. Pomares, I. Rojas, and L. J. Herrera, “A heuristic method for pa-
rameter selection in ls-svm: Application to time series prediction.” International
Journal of Forecasting, pp. 725–739, 2011.
[4] J. d. Preez and S. F. Witt, “Univariate versus multivariate time series forecast-
ing: an application to international tourism demand,” International Journal of
Forecasting, pp. 435–451, 2003.
[5] P. J. Keall, G. S. Mageras, J. M. Balter, R. S. Emery, K. M. Forster, S. B. Jiang,
and E. Yorke, “The management of respiratory motion in radiation oncology
report of aapm task group 76a),” Medical physics, vol. 33, no. 10, pp. 3874–3900,
2006.
[6] M. Bigovic, “Demand forecasting within montenegrin tourism using box-jenkins
methodology for seasonal arima models,” Tourism and Hospitality Management,
pp. 1–18, 2012.
[7] S. Wang, “Construct an optimal triage prediction model: A case study of the
emergency department of a teaching hospital in taiwan,” Journal of Medical
Systems, p. 37, 2013.
129
[8] Y. Chang and M. Liao, “A seasonal arima model of tourism forecasting: The
case of taiwan,” asia pacific journal of tourism research,” Asia Pacific Journal
of Tourism Research, pp. 215–221, 2010.
[9] D. N. J. Peck, J. Benneyan and S. Gaehde, “Predicting emergency department
inpatient admissions to improve same-day patient flow,” Academic Emergency
Medicine, p. 1045, 2012.
[10] X. L. M. Babcocka and J. Norton, “Time series forecasting of quarterly railroad
grain carloadings,” Transportation Research Part E, pp. 43–57, 1999.
[11] E. Walter, “Models with trend,” Applied Econometric Time Series (Second ed.).
New York: Wiley, pp. 156–238, 2004.
[12] R. H. A. DE LIVERA and R. SNYDER, “Forecasting time series with com-
plex seasonal patterns using exponential smoothing,” Journal of the American
Statistical Association, p. 1513, 2014.
[13] P. Yelland, “Bayesian forecasting for low-count time series using state-
space models: An empirical evaluation for inventory management,” Int.
J.ProductionEconomics, pp. 95–103, 2009.
[14] N. Homma, M. Sakai, H. Endo, M. Mitsuya, Y. Takai, and M. Yoshizawa, “A
new motion management method for lung tumor tracking radiation therapy,”
WSEAS Transactions on Systems, vol. 8, no. 4, pp. 471–480, 2009.
[15] F. Ernst, A. Schlaefer, and A. Schweikard, “Prediction of respiratory motion
with wavelet-based multiscale autoregression,” Medical Image Computing and
Computer-Assisted InterventionMICCAI 2007, pp. 668–675, 2007.
[16] F. Ernst and A. Schweikard, “Forecasting respiratory motion with accurate online
support vector regression (svrpred),” International journal of computer assisted
radiology and surgery, vol. 4, no. 5, pp. 439–447, 2009.
130
[17] K. Ichiji, N. Homma, M. Sakai, Y. Narita, Y. Takai, and X. Zhang, “A time-
varying seasonal autoregressive model-based prediction of respiratory motion
for tumor following radiotherapy,” Computational and mathematical methods in
medicine, 2013.
[18] S. Wang, “Online monitoring and prediction of complex time series events from
nonstationary time series data,” Ph.D. dissertation, The State University of New
Jersey, 2012.
[19] B. Abraham and J. Ledolter, “Statistical methods for forecasting,” Hoboken, NJ:
John Wiley and Sons, Inc, 2005.
[20] G. E. P. Box and G. M. Jenkins, “Time series analysis: forecasting and control,”
Francisco Holden-Day.
[21] Introduction to arima: nonseasonal models.
[22] M. West and J. Harrison, “Bayesian forecasting and dynamic models,” New York,
NY: Springer-Verlag New York, Inc., 1997.
[23] X. Fei, Y. Zhange, K. Liu, and M. Guo, “Bayesian dynamic linear model with
switching for real-time short-term freeway travel time prediction with license
plate recognition data.” Journal of Transportation Engineering, vol. 139, no. 11,
p. 1058, 2013.
[24] F. Ernst, R. Drichen, A. Schlaefer, and A. Schweikard, “Evaluating and com-
paring algorithms for respiratory motion prediction,” Physics in medicine and
biology, p. 3911, 2013.
[25] D. Ruan, “Image guided respiratory motion analysis: time series and image
registration,” Ph.D. dissertation, The University of Michigan, Ann Arbor, 2008.
[26] F. Ernst, Compensating for quasi-periodic Motion in robotic radiosurgery).
Springer, 2011.
131
[27] K. Ichiji, N. Homma, M. Sakai, M. Abe, N. Sugita, and M. Yoshizawa, A Respi-
ratory Motion Prediction Based on Time-Variant Seasonal Autoregressive Model
for Real-Time Image-Guided Radiotherapy. INTECH, 2013, ch. Chapter 5, pp.
75–90.
[28] A. Krauss, A. Nill, and U. Oelfke, “The comparative performance of four res-
piratory motion predictors for real-time tumour tracking,” Physics in medicine
and biology, vol. 56, no. 16, p. 5303, 2011.
[29] D. Ruan, “Kernel density estimation-based real-time prediction for respiratory
motion,” Physics in medicine and biology, vol. 55, no. 5, p. 1311, 2010.
[30] O. Renaud, J. L. Starck, and F. Murtagh, “Prediction based on a multiscale
decomposition,” International Journal of Wavelets, Multiresolution and Infor-
mation Processing, vol. 1, no. 02, pp. 217–232, 2003.
[31] Y. Chen, B. Yang, and J. Dong, “Time-series prediction using a local linear
wavelet neural network,” neurocomputingt,” Neurocomputing, vol. 69, no. 4, pp.
449–465, 2006.
[32] S. Choi, Y. Chang, N. Kim, S. Park, S. Y. Song, and H. S. Kang, “Performance
enhancement of respiratory tumor motion prediction using adaptive support vec-
tor regression: Comparison with adaptive neural network method,” International
Journal of Imaging Systems and Technology, vol. 24, no. 1, pp. 8–15, 2014.
[33] S. Guo, R. M. Lucas, and A. L. Ponsonby, “A novel approach for prediction of
vitamin d status using support vector regression,” PloS one, vol. 8, no. 11, p.
e79970, 2013.
[34] N. Riaz, P. Shanker, R. Wiersma, O. Gudmundsson, W. Mao, B. Widrow, and
L. Xing, “Predicting respiratory tumor motion with multi-dimensional adaptive
filters and support vector regression,” Physics in medicine and biology, vol. 54,
no. 19, p. 5735, 2009.

132
[35] A. J. Smola and B. Schlkopf, “A tutorial on support vector regression,” Statistics
and computing, vol. 14, no. 3, pp. 199–222, 2004.
[36] Y. Bao, T. Xiong, and Z. Hu, “Multi-step-ahead time series prediction using
multiple-output support vector regression,” Neurocomputing, vol. 129, pp. 482–
493, 2014.
[37] K. Ichiji, N. Homma, M. Sakai, Y. Takai, Y. Narita, M. Abe, and M. Yoshizawa,
“Respiratory motion prediction for tumor following radiotherapy by using time-
variant seasonal autoregressive techniques,” Engineering in Medicine and Biology
Society (EMBC), 2012 Annual International Conference of the IEEE, vol. 8, pp.
6028–6031, 2012.
[38] S. Wang, J. Gwizdka, and W. A. Chaovalitwongse, “Using physiological signals
to assess mental motion for tumor following radiotherapy,” IEEE TRANSAC-
TIONS ON HUMAN-MACHINE SYSTEMS, 2014.
[39] E. Fuchs, T. Gruber, J. Nitschke, and B. Sick, “Online segmentation of time
series based on polynomial least-squares approximations,” Pattern Analysis and
Machine Intelligence, IEEE Transactions on, pp. 2232–2245, 2010.
[40] Y. S. Jeong, M. K. Jeong, and O. A. Omitaomu, “Weighted dynamictimewarp-
ingfortimeseriesclassification,” 2011.
[41] R. Croft and R. Barry, “Removal of ocular artifact from the eeg: A review.”
Clinical Neurophysiology, vol. 30, pp. 5–19, 2000.
[42] A. Mognon, J. Jovicich, L. Bruzzone, and M. Buiatti, “Adjust: An automatic
eeg artifact detector based on the joint use of spatial and temporal features.”
Psychophysiology, vol. 48, no. 2, pp. 229–240, 2011.
[43] D. Grimes, D. Tan, S. Hudson, P. Shenoy, , and R. Rao, “Feasibility and prag-
matics of classifying working memory load with an electroencephalograph,” In
133
Proceedings of the SIGCHI Conference on Human Factors in Computing Sys-
tems, vol. 08, pp. 835–844, 2008.
[44] R. Yuvaraj, M. Murugappan, N. M. Ibrahim, M. I. Omar, K. Sundaraj, K. Mo-
hamad, R. Palaniappan, E. Mesquita, and M. Satiyan, “On the analysis of eeg
power, frequency and asymmetry in parkinsons disease during emotion process-
ing,” Behavioral and Brain Functions, vol. 10, no. 1, p. 12, 2014.
[45] S. Wang, C. Lin, C. Wu, and W. Chaovalitwongse, “Early detection of numerical
typing errors using data mining techniques,” IEEE Transactions on Systems,
Man, and Cybernetics, Part A: Systems and Humans, vol. 41, no. 6, pp. 1199–
1212, 2011.
[46] S. Wong, G. Baltuch, J. Jaggi, and S. Danish, “Functional localization and vi-
sualization of the subthalamic nucleus from microelectrode recordings acquired
during dbs surgery with unsupervised machine learning,” Journal of Neural En-
gineering, vol. 6, p. 026006, 2009.
[47] D. Olsen, R. Lesser, J. Harris, R.Webber, and J. Cristion, “Automatic detection
of seizures using electroencephalographic signals,” U.S. Patent 5311876, 1994.
[48] R. Esteller, J. Echauz, T. Cheng, B. Litt, and B. Pless, “An efficient feature
for seizure onset detection,” Proceedings of the 23rd International Conference of
IEEE Engineering Medicine Biology Society, vol. 2, pp. 1707–1710, 2001.
[49] R. Esteller, J. Echauz, and T. Tcheng, “Comparison of line length feature before
and after brain electrical stimulation in epileptic patients,” Proceedings of the
26th International Conference of IEEE Engineering Medicine Biology Society,
pp. 4710–4713, 2004.
[50] J. Kaiser, “On a simple algorithm to calculate the energy of a signal,” Proceedings
of 1990 International Conference of Acoustis, Speech, Signal Processing, vol. 1,
pp. 381–384, 1990.

134
[51] R. Agarwal and J. Gotman, “Adaptive segmentation of electroencephalographic
data using a nonlinear energy operator,” Proceedings of 1999 IEEE International
Symposium on Circuits and Systems, vol. 4, pp. 199–202, 1999.
[52] N. Addison, “The illustrated wavelet transform handbook: introductory the-
ory and applications in science, engineering, medicine, and finance,” Taylor and
Francis, 2002.
[53] O. Rosso, M. Martin, A. Figliola, K. Keller, and A. Plastino, “Eeg analysis
using wavelet-based information tools,” Journal of Neuroscience Methods, vol.
153, no. 2, pp. 163–182, 2006.
[54] A. Subasi, “Eeg signal classification using wavelet feature extraction and a mix-
ture of expert model,” Expert Systems with Applications, vol. 32, no. 4, pp.
1084–1093, 2007.
[55] O. Rosso, S. Blanco, J. Yordanova, V. Kolev, A. Figliola, M. Schourmann, and
E. Basar, “Wavelet entropy: a new tool for analysis of short duration brain
electrical signals,” Journal of Neuroscience Methods, vol. 105, no. 1, pp. 65–75,
2001.
[56] C. Shannon, “A mathematical theory of communication,” Bell System Technical
Journal, vol. 27, no. 379423, pp. 623–656, 1948.
[57] B. Blankertz, G. Curio, and K. Muller, “Classifying single trial eeg: towards
brain computer interfacing,” Advances in Neural Information Processing Sys-
tems, vol. 14, no. 2, pp. 157–164, 2002.
[58] T. Lal, T. Hinterberger, G. Widman, M. Schroer, J. Hill, W. Rosenstiel, C. Elger,
B. Schokopf, and N. Birbaumer, “Advances in neural information processing
systems, volume 17, chapter methods towards invasive human brain computer
interfacess,” MIT Press, vol. 17, pp. 737–744, 2005.
135
[59] A. Rakotomamonjy, V. Guigue, G. Mallet, and V. Alvarado, “Ensemble of svms
for improving brain computer interface p300 speller performances,” Artificial
Neural Networks: Biological Inspirations ICANN 2005, volume 3696, chapter.
Springer Berlin / Heidelberg, vol. 4, pp. 45–5, 2005.
[60] M. Kaper, P. Meinicke, U. Grossekathoefer, T. Lingner, and H. Ritter, “Bci
competition 2003-data set iib: support vector machines for the p300 speller
paradigm,” IEEE Transactions on Biomedical Engeneering, vol. 51, no. 6, pp.
1073–1076, 2004.
[61] D. Garrett, D. Peterson, C. Anderson, and M. Thaut, “Comparison of linear,
nonlinear, and feature selection methods for eeg signal classification,” IEEE
Transactions on Neural Systems and Rehabilitation Engineering, vol. 11, no. 2,
pp. 141–144, 2003.
[62] O. Mangasarian and E. Wild, “Proximal support vector machine classifiers,” In
Proceedings of Knowledge Discovery and Data Mining, pp. 77–86, 2001.
[63] M. Stone, “Cross-validatory choice and assessment of statistical predictions,”
Journal of the Royal Statistical Society: Series B (Statistical Methodological),
vol. 36, no. 2, pp. 111–147, 1974.
[64] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual informa-
tion criteria of max-dependency, max-relevance, and min-redundancy,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 8, pp.
1226–1238, 2005.
136
BIOGRAPHICAL STATEMENT
Jerry K.M. Kam joined the Department of Industrial & Manufacturing System
Engineering at UTA in the Fall of 2010. He received his B.S. degree in Industrial
Engineering & Engineering Management from City University of Hong Kong. He is
co-advised by Prof. Li Zeng and Prof. Shouyi Wang on his PhD study. Currently,
he is working with Prof. Wang on research problems in the field of time series data
mining including respiratory motion time series prediction, time series segmentation
and EEG signals classification.
137

Forecast

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Forecast

Uploaded by

Copyright:

Available Formats

STATIONARY AND NON-STATIONARY TIME SERIES PREDICTION USING

STATE SPACE MODEL AND PATTERN-BASED APPROACH

KIN MING KAM

Presented to the Faculty of the Graduate School of

The University of Texas at Arlington in Partial Fulfillment

for the Degree of

THE UNIVERSITY OF TEXAS AT ARLINGTON

All Rights Reserved

tremendous efforts on training me, constantly motivating, encouraging and challeng-

course of my PhD study.

serve in my dissertation committee.

have spent in UT Arlington, especially to Dr Jay Rosenburger and Dr Corley. I would

and patience so that I could focus on my study.

November 20, 2014

STATIONARY AND NON-STATIONARY TIME SERIES PREDICTION USING

STATE SPACE MODEL AND PATTERN-BASED APPROACH

KIN MING KAM, Ph.D.

The University of Texas at Arlington, 2014

Supervising Professors: Li Zeng, Shouyi Wang

The motion-adaptive radiotherapy techniques are promising to deliver ablative

real-time tumor movement. However, a major challenge of successful applications of

these techniques is the real-time prediction of target motion to accommodate system

delivery latencies. Predicting respiratory motion in real-time is challenging. The

of accuracy and interpretability. Therefore, we propose a novel respiratory motion

prediction approach based on future values of best-matching patterns. In particular,

lated pattern library by orthogonal polynomial approximation using a sliding window

piratory motion prediction approaches, particularly for long prediction lengths.

impacts memory in general, beyond working memory. A comprehensive EEG pattern

time-frequency (wavelet) analysis, power-spectra analysis, and deterministic chaotic

non-musicians. High classification accuracy (more than 95%) in memory judgments

it showed significant differences between musicians versus non-musicians during the

recognition. These results indicate that musicians memorial advantage occurs in

framework using advanced data mining techniques can be successfully applied to

classify complex human cognition with high time resolution.

1.2 Research Objectives and Challenges . . . . . . . . . . . . . . . . . . . 3

1.2.1 Demand Forecasting in Service Industries . . . . . . . . . . . . 3

1.2.2 Problem and Challenges . . . . . . . . . . . . . . . . . . . . . 4

1.2.3 Pattern-Based Online Prediction of Semi-periodic and Nonsta-

tionary Time Series . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 10

2. ARIMA and Dynamic Linear Model for Time Series Forecasting . . . . . . 12

2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 ARIMA and Dynamic Linear Model . . . . . . . . . . . . . . . . . . . 15

2.2.2 Dynamic Linear Models . . . . . . . . . . . . . . . . . . . . . 24

2.3 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.1 Scenario Design and Computation Procedure . . . . . . . . . . 29

2.4.2 Concerns To Address . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.4 Summary of Numerical Study . . . . . . . . . . . . . . . . . . 39

2.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.6 Summary And Future Work . . . . . . . . . . . . . . . . . . . . . . . 45

3. Pattern-Based Real-Time Prediction Of Semi-Periodic And Nonstationary

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.1 Time-Varying Seasonal Autoregression (TVSAR) . . . . . . . 51

3.2.2 Wavelet-Based Multiscale Autoregression . . . . . . . . . . . 54

3.3 Pattern-Based Variant Best-Neighbors Prediction Using Raw Data . . 57

3.3.1 Personalized Pattern Monitoring Window Size . . . . . . . . . 60

3.3.2 Variant Best-Neighbors-Based Predictive Pattern Selection . . 64

3.3.3 Online Prediction Frameworks Using the Selected Predictive

3.3.4 Comparison for the Prediction Performance of RPKM and Some

3.4 Pattern-Based Variant-Best-Neighbors Prediction Using Orthogonal

Polynomials Approximated Respiratory Motion Time Series . . . . . 80

3.4.1 Orthogonal Polynomials Appximation . . . . . . . . . . . . . . 82

3.4.2 Prediction Results of RPKM and OPPRED . . . . . . . . . . 90