You are on page 1of 153

STATIONARY AND NON-STATIONARY TIME SERIES PREDICTION USING

STATE SPACE MODEL AND PATTERN-BASED APPROACH

by

KIN MING KAM

Presented to the Faculty of the Graduate School of

The University of Texas at Arlington in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT ARLINGTON

December 2014
Copyright
c by KIN MING KAM 2014

All Rights Reserved


To my mother, Kwai Fan Yip, my father, Yat On Kam, and my wife, Ka Yee Mak.
ACKNOWLEDGEMENTS

I would like to thank my advisors Dr. Li Zeng and Dr. Shouyi Wang for their

tremendous efforts on training me, constantly motivating, encouraging and challeng-

ing me, and also for all of their invaluable guidance and the supports during the

course of my PhD study.

I wish to thank Dr. Victoria Chen for her consistent support since the very

beginning of my PhD study and for her interest in my research and taking time to

serve in my dissertation committee.

Also, I am grateful to all the teachers who have taught me during the years I

have spent in UT Arlington, especially to Dr Jay Rosenburger and Dr Corley. I would

also like to thank my classmates whom I studied together, did projects together and

shared all the precious moment together. And, I would like to thank the adminis-

trative staff members in the office who did excellent work during the course of my

study.

Last but not least, I would like to thank my parents and my wife for their love

and patience so that I could focus on my study.

November 20, 2014

iv
ABSTRACT

STATIONARY AND NON-STATIONARY TIME SERIES PREDICTION USING

STATE SPACE MODEL AND PATTERN-BASED APPROACH

KIN MING KAM, Ph.D.

The University of Texas at Arlington, 2014

Supervising Professors: Li Zeng, Shouyi Wang

The motion-adaptive radiotherapy techniques are promising to deliver ablative

radiation doses to tumor with minimal normal tissue exposure by accounting for

real-time tumor movement. However, a major challenge of successful applications of

these techniques is the real-time prediction of target motion to accommodate system

delivery latencies. Predicting respiratory motion in real-time is challenging. The

current respiratory motion prediction approaches are still not satisfactory in terms

of accuracy and interpretability. Therefore, we propose a novel respiratory motion

prediction approach based on future values of best-matching patterns. In particular,

there are three major ingredients of this approach: (1) construct a real-time accumu-

lated pattern library by orthogonal polynomial approximation using a sliding window

approach, (2) nd k nearest-neighbor patterns in the pattern library and apply a two-

step approach to screen out the disturbing patterns and nd out the nal predictive

patterns. (3) the nal prediction is made using the bootstrapped mean of the future

values of the selected predictive patterns given a prediction horizon. Based on a study

of respiratory motion traces of 27 patients with lung cancer, the proposed prediction

v
approach has generated consistently signicant higher accuracies than the current res-

piratory motion prediction approaches, particularly for long prediction lengths.

There has been much interest in the beneficial effects of musical training on

cognition. Previous studies have indicated that musical training was related to better

working memory and that these behavioral differences were associated with differences

in neural activity in the brain. However, it was not clear whether musical training

impacts memory in general, beyond working memory. A comprehensive EEG pattern

study has been performed, including various univariate and multivariate features,

time-frequency (wavelet) analysis, power-spectra analysis, and deterministic chaotic

theory. The advanced feature selection approaches have also been employed to select

the most discriminative EEG and brain activation features between musicians and

non-musicians. High classification accuracy (more than 95%) in memory judgments

was achieved using Proximal Support Vector Machine (PSVM). For working memory,

it showed significant differences between musicians versus non-musicians during the

delay period. For long-term memory, significant differences on EEG patterns between

groups were found both in the pre-stimulus period and the post-stimulus period on

recognition. These results indicate that musicians memorial advantage occurs in

both working memory and long-term memory and that the developed computational

framework using advanced data mining techniques can be successfully applied to

classify complex human cognition with high time resolution.

vi
TABLE OF CONTENTS

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF ILLUSTRATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Chapter Page

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Objectives and Challenges . . . . . . . . . . . . . . . . . . . 3

1.2.1 Demand Forecasting in Service Industries . . . . . . . . . . . . 3

1.2.2 Problem and Challenges . . . . . . . . . . . . . . . . . . . . . 4

1.2.3 Pattern-Based Online Prediction of Semi-periodic and Nonsta-

tionary Time Series . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 10

2. ARIMA and Dynamic Linear Model for Time Series Forecasting . . . . . . 12

2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 ARIMA and Dynamic Linear Model . . . . . . . . . . . . . . . . . . . 15

2.2.1 ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Dynamic Linear Models . . . . . . . . . . . . . . . . . . . . . 24

2.3 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.1 Scenario Design and Computation Procedure . . . . . . . . . . 29

2.4.2 Concerns To Address . . . . . . . . . . . . . . . . . . . . . . . 31

vii
2.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.4 Summary of Numerical Study . . . . . . . . . . . . . . . . . . 39

2.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.6 Summary And Future Work . . . . . . . . . . . . . . . . . . . . . . . 45

3. Pattern-Based Real-Time Prediction Of Semi-Periodic And Nonstationary

Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.1 Time-Varying Seasonal Autoregression (TVSAR) . . . . . . . 51

3.2.2 Wavelet-Based Multiscale Autoregression . . . . . . . . . . . 54

3.3 Pattern-Based Variant Best-Neighbors Prediction Using Raw Data . . 57

3.3.1 Personalized Pattern Monitoring Window Size . . . . . . . . . 60

3.3.2 Variant Best-Neighbors-Based Predictive Pattern Selection . . 64

3.3.3 Online Prediction Frameworks Using the Selected Predictive

Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.3.4 Comparison for the Prediction Performance of RPKM and Some

State-Of-The-Art Methods . . . . . . . . . . . . . . . . . . . . 76

3.4 Pattern-Based Variant-Best-Neighbors Prediction Using Orthogonal

Polynomials Approximated Respiratory Motion Time Series . . . . . 80

3.4.1 Orthogonal Polynomials Appximation . . . . . . . . . . . . . . 82

3.4.2 Prediction Results of RPKM and OPPRED . . . . . . . . . . 90

3.4.3 Weighted Orthogonal Polynomials Approximations . . . . . . 95

3.4.4 Weighted Time Series Pattern Matching . . . . . . . . . . . . 96

3.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 98

3.5.1 Future Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 101

viii
4. Pattern Recognition and Classification of Multivariate Time Series Signals:

EEG Study of Musicians and Non-Musicians . . . . . . . . . . . . . . . . . 103

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.2.1 Data Acquisition and Experimental Settings . . . . . . . . . . 104

4.2.2 Artifacts Removal . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.2.3 Signal Feature Extraction . . . . . . . . . . . . . . . . . . . . 108

4.2.4 Feature Vector Classification Using Proximal Support Vector

Machine (PSVM) . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.3 Result Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.4 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 121

5. Conclusions and Future Research . . . . . . . . . . . . . . . . . . . . . . . 126

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

BIOGRAPHICAL STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . 137

ix
LIST OF ILLUSTRATIONS

Figure Page

1.1 Examples of complex seasonality showing (a) non-integer seasonal peri-

ods, (b) multiple nested seasonal periods, and (c) multiple non-nested

and non-integer seasonal periods . . . . . . . . . . . . . . . . . . . . . 5

1.2 An example of low-count time series which are sample inventory de-

mands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 A computer-simulated lung . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Some examples of respiratory motion time series . . . . . . . . . . . . 9

1.5 Outline of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Categorization of quantitative forecasting models . . . . . . . . . . . . 14

2.2 Examples of stationary and homogeneous nonstationary time series . 16

2.3 Illustration of 1-step forecasting . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Illustration of k-step forecasting . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Structure of the dynamic linear model . . . . . . . . . . . . . . . . . . 25

2.6 The proposed forecasting method with R estimates . . . . . . . . . . . 28

2.7 An illustration of the definition of Interval and a histogram of the in-

terval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.8 Histogram of MSE of V . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.9 Histogram of r/R of V . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.10 Histogram of MSE of nf . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.11 Histogram of r/R of nf . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.12 Histogram of MSE of R . . . . . . . . . . . . . . . . . . . . . . . . . . 35

x
2.13 Histogram of r/R of R . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.14 Design of experiment and the prediction performance (mean of MSE

and r/R of each set of experiment) . . . . . . . . . . . . . . . . . . . . 36

2.15 Performance of the proposed procedure for forecasting . . . . . . . . . 37

2.16 Performance of forecasting with Updating-R vs. Fixed-R . . . . . . . . 38

2.17 The data used in the case study . . . . . . . . . . . . . . . . . . . . . . 40

2.18 Forecasting results of the six datasets . . . . . . . . . . . . . . . . . . 41

2.19 and PACF plots of original data in case 1 . . . . . . . . . . . . . . . . 41

2.20 ACF and PACF plots of deseasonalized data in case 1 . . . . . . . . . 42

2.21 ACF and PACF plots of model (0, 0, 0)(1, 1, 1)7 for case 1 . . . . . . . 43

2.22 ACF and PACF plots of model (1, 0, 1)(1, 1, 1)7 for case 1 . . . . . . . 43

2.23 MSE of the two methods: DLM (dark blue) vs. ARIMA for 6 cases

from left to right and then up to down . . . . . . . . . . . . . . . . . . 44

2.24 R estimate of the 6 cases from left to right and then. We see that except

case 5, the R estimate approaches to a stable value when there are more

observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.25 Updating-R vs. Fixed-R (R = [0.001 0.01 0.02 0.05 0.08]) . . . . 46

3.1 Estimation procedure for reference intervals . . . . . . . . . . . . . . . 53

3.2 Wavelet decomposition of a respiratory motion time series . . . . . . . 55

3.3 An example of an order-3 AR model built by 2-level wavelet scales . . 56

3.4 The general approach of the proposed pattern-based Variant-Best-Neighbors

prediction by using raw data . . . . . . . . . . . . . . . . . . . . . . . 58

3.5 Three best neighbors (solid black lines) of the current segment (solid

blue line), the dotted lines are their ”future” values . . . . . . . . . . . 59

3.6 Scatter Plots (left) and Autocorrelation Function (right) of the height

and the interval of respiratory motion versus its 1st lag . . . . . . . . . 60


xi
3.7 An illustration of the definition of Interval and a histogram of the in-

terval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.8 The prediction accuracy for various window length with window ratio

to median interval (R) ranging from 0.3 to 1.5 for prediction horizon

h=1,5,10,15,20,25,30 for patient 16 . . . . . . . . . . . . . . . . . . . . 62

3.9 A 3-D plot of the prediction accuracy for various window length with

window ratio to median interval (R) ranging from 0.3 to 1.5 for predic-

tion horizon h=1,5,10,15,20,25,30 for patient 16 . . . . . . . . . . . . . 63

3.10 A flow chart of the VBN procedure: Phase I . . . . . . . . . . . . . . . 64

3.11 Online prediction of a patient’s respiratory data by using unaligned

BNs(Left) and right-aligned BNs(right). Belows are the best neighbors

marked with vertical lines in the time series. . . . . . . . . . . . . . . . 66

3.12 A zoom-in view of Figure 3.11, using unaligned BNs(Left) and right-

aligned BNs(right). We can see that right-aligned BNs is obviously

better than unaligned BNs. . . . . . . . . . . . . . . . . . . . . . . . . 66

3.13 Scatter plots of the error before tλ vs the error after tλ . Correlation

between the errors is observed. . . . . . . . . . . . . . . . . . . . . . . 69

3.14 An illustration of the error of the best neighbors before and after time

tλk . If the error at the left hand side is large, then error at the right

hand side is also likely to be large. . . . . . . . . . . . . . . . . . . . . 69

3.15 A real example of the error of the best neighbors before and after time

tλk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.16 An example of an outlier in the best neighbors . . . . . . . . . . . . . 71

3.17 Another example of best neighbors without any outliers . . . . . . . . 71

3.18 Kolmogorov-Smirnov test during prediction of the respiratory motion of

patient 2 with prediction horizon h = 15 . . . . . . . . . . . . . . . . . 73


xii
3.19 Illustration of support vector regression with insensitive parameter 

and slack variable ξ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.20 Prediction performance of RPKM, RPKS and the state-of-the-art meth-

ods for prediction horizons h=20 to h=30 . . . . . . . . . . . . . . . . 79

3.21 Prediction performance of RPKM and RPKM(without adaptive ratio)

for prediction horizons h=1 to h=30 . . . . . . . . . . . . . . . . . . . 81

3.22 The general approach of the proposed pattern-based Variant-Best-Neighbors

prediction by using orthogonal polynomials approximated respiratory

motion time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.23 Legendre Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.24 An example of OPs approximation such that the approximation of lower

order (order 18) is better than higher order (order 20) . . . . . . . . . 88

3.25 Prediction performance of RPKM, OPPRED, res TVSAR, wLMS, SVR-

pred and SARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.26 Prediction results of Patient 9 with h=15 . . . . . . . . . . . . . . . . 92

3.27 Prediction results of Patient 2 with h=15 . . . . . . . . . . . . . . . . 93

3.28 This example shows that even the two time series have the same amount

of error but the occurrences of the errors can be very different. The

above plot shows that the two patterns match very well in the older

data (left) but do not match well in the newest data. Therefore, for

prediction, we would prefer the below one. . . . . . . . . . . . . . . . . 94

3.29 The weights of the shorter window (black dotted) and the longer window

(red dotted) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3.30 Prediction performance of RPKM, OPPRED on noise-added data . . . 99

3.31 An example of noise-added time series in the simulation study. Simu-

lated noise is added to respiratory time series data of a patient . . . . 100


xiii
4.1 Schematic of experimental paradigm. A1 to A5) During study period,

participants were asked to judge whether the second stimulus matched

the first. B1 to B3) During test period, participants made memory

judgments to stimuli while rating their confidence. Low represents re-

member with low confidence, High represents remember with high con-

fidence, and New represents a judgment where participants thought the

stimulus was not studied. . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.2 The Map of the channel locations . . . . . . . . . . . . . . . . . . . . . 107

4.3 Artifact Removal Using ICA . . . . . . . . . . . . . . . . . . . . . . . 109

4.4 Topographies for ICA-Based Artifact Removal . . . . . . . . . . . . . . 110

4.5 Group of Channels for Inter- and Intra-hemispheric power band asym-

metry. For Inter-hemispheric power band asymmetry, the value is cal-

culated by pairs of same colors over another hemisphere. For Intra-

hemispheric power band asymmetry, the value is calculated by pairs of

different colors within the same hemisphere. . . . . . . . . . . . . . . . 111

4.6 Comparison for the EEG signals of 30 channels of musicians (blue line)

and non-musicians (red line) at epoch B1 and condition 30. . . . . . . 122

4.7 Head plot for musicians and non-musicians at epoch B1 at 100sec with

ICA-Based Artifact Removal . . . . . . . . . . . . . . . . . . . . . . . 123

xiv
LIST OF TABLES

Table Page

3.1 A list of latancies of different systems. . . . . . . . . . . . . . . . . . . 50

3.2 The prediction performance metrics, mean and standard deviation of R-

squares, of the proposed methods and the state-of-the-art of respiratory

motion prediction methods on 27 patients . . . . . . . . . . . . . . . . 78

3.3 The prediction performance metrics, mean and standard deviation of

R-squares, of the proposed approaches with and without adaptive ratio

on 27 patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.4 The coefficients of orthogonal polynomials up to order 20 . . . . . . . 87

3.5 The prediction performance of RPKM and OPPRED on 27 patients. . 90

3.6 The prediction performance metrics, mean and standard deviation of

R-squares, of the proposed approaches on first 4 patients noise-added

respiratory motion time series. . . . . . . . . . . . . . . . . . . . . . . 98

4.1 Frequency ranges and the corresponding brain signal frequency bands

of the five levels of signals by discrete wavelet decomposition. . . . . . 115

4.2 A list of all comparison conditions of the experiments. For comparison

conditions 4 to 11, the naming structure was stimuli/grand truth/response.

For conditions 12 to 23, the naming structure was stimuli/response to

that stimuli in test session/if it was the 1st or 2nd stimuli. For condi-

tions 35 to 46, it was stimuli/confidence level of having seen the stim-

uli/correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

xv
4.3 The table of the classification accuracy for 46 conditions and 8 epochs

with 5-fold cross validation and 10 features selected by mRMR and

without any ICA artifacts removal. . . . . . . . . . . . . . . . . . . . . 120

4.4 The table of the classification accuracy for 46 conditions and 8 epochs

with 5-fold cross validation and 10 features selected by mRMR and with

ICA artifacts removal . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.5 The table of the classification sensitivity and specificity for 46 conditions

and 8 epochs with 5-fold cross validation and 10 features selected by

mRMR and with ICA artifacts removal . . . . . . . . . . . . . . . . . 124

4.6 The table of the classification sensitivity and specificity for 46 conditions

and 8 epochs with 5-fold cross validation and 10 features selected by

mRMR and with ICA artifacts removal . . . . . . . . . . . . . . . . . 125

xvi
CHAPTER 1

Introduction

1.1 Motivation

In 1960, Muth pioneered simple exponential smoothing (SES) and developed

a useful classification of the trend and seasonal patterns depending on whether they

are additive or multiplicative [1]. Following the work of Box and Jenkins, some

linear exponential smoothing forecasts were showed as special cases of ARIMA model.

In 1985, Snyder proposed a class of innovation state space models and paved the

way of the development of forecasting models for nonlinear exponential smoothing

methods which can be derived statistically. Through these years, many time series

problems have been identified and various methods have been developed to overcome

the challenges in time series analysis.

Time series analysis has attracted much attention in the past three decades.

According to Google Scholar search engine, until 1990, there were only 67,000 litera-

tures containing the keywords ”time series”, while in 2000 the number rose to 453,000.

Currently, 3,110,000 literatures can be found in contrast to the 8,680,000 literatures

containing the keyword ”data”. Although this may be a rough text mining, still, we

are able to see a tremendous volume of researches in the context of time series.

According to Google scholar, there are 1,660,000 results on the keywords ”time

series” and prediction. It constitutes over half of the total research volume in time

series analysis. Time series prediction is a very popular and challenging problem.

Time series analysis comprises methods for analyzing time series data in order

to extract meaningful information out of the data which includes several major ar-

1
eas of study: indexing, clustering, classification, prediction, summarization, anomaly

detection and segmentation [2]. These resemble to most data mining areas which

are very popular in scientific research and industries. These methods are very often

used together. For instance, Rubio proposed a weighted least squares support vector

machine for time series prediction which combines the use of prediction and classifi-

cation [3]. In our study of respiratory motion time series prediction, we apply pattern

matching, regression and anomaly detection.

Time series prediction is a study that uses a model to predict future values

based on historical data. In service industries, the ability to accurately estimate the

demand is very important for better marketing and cost saving. Preez [4] investigated

tourism demand from four European countries to the Seychelles by using time series

prediction. In healthcare service, time series prediction is applied on radiotherapy

to give patients a better quality of life [5, 6]. In hospital, management applies time

series prediction to predict the demand of nurse triage centers. [7] In transportation

industry, companies do prediction on the demand of cargo or human transportation.

Time series prediction can be classified into two categories: stationary time se-

ries prediction and non-stationary time series prediction. ARIMA and many stochas-

tic models, such as dynamic linear models, perform well on stationary data. ARIMA

is one of the most popular methods in time series prediction because of its generic

properties [8]. Autoregressive, moving average and exponential smoothing are spe-

cial cases of the ARIMA framework [1]. However, ARIMA has its own limitations.

Also, these methods generally do not do well on non-stationary data. Therefore, new

methods are developed to cope with various situations.

The advancement of information technologies in healthcare and other service

industries provide tremendous amounts of data. In other words, that provides us a lot

2
of research opportunities. Large amount of data can be available at service industries

such as hospitals, transportation companies and travelling agents.

This dissertation has been focused on the demand forecasting for service indus-

tries and respiratory motion time series prediction. These two problems cover both

stationary and semi-periodic time series. Therefore, the solutions provided have great

potential to apply on a broad range of problems in practice. Specifically, respiratory

motion time series prediction which is to predict the tumor position during radiother-

apy, are considered as well as the number of calls received at a nurse triage center

and the loads history of cargo in railroad service. The research seeks to explore fun-

damental methodologies to improve the currently available methods and to conquer

the difficulties in the prediction of time-varying time series data.

1.2 Research Objectives and Challenges

1.2.1 Demand Forecasting in Service Industries

Background Preez [4] points out that accurate forecasts of tourism demand are

essential for efficient planning by the various sectors of the tourism industry, and

forecast accuracy is particularly important in the tourism context as the tourism

product is perishable, e.g. unused plane seats, hotel rooms and hire car rental cannot

be stockpiled. Specifically, short-term forecasts can aid decision making in areas such

as scheduling, staffing and planning tour operator brochures.

According to Peck [9], emergency department (ED) crowding in hospital is a

major problem nationally and occurs when there is a mismatch between the demand

and supply of the resources needed to evaluate, treat, and discharge patients from

the ED. In current practice, bed requests and preparation to receive the patient often

are delayed until admission is certain. As ED is usually very crowded, unutilized

3
facilities are not desired. Therefore, Peck investigated forecasting methods to predict

the demand.

Babcock studied a forecasting problem on cargo grain for railroad industries

[10]. Grain shippers need the forecasts to evaluate transportation equipment needs,

establish marketing plans, and formulate strategies for negotiating prices and service

with railroads. Port authorities need forecasts of rail grain transportation for port

utilization monitoring and port expansion plan.

1.2.2 Problem and Challenges

Time series prediction involves time series analysis which is to decompose the

properties of the time series and quantify the individual property. These properties

include seasonality, non-periodic cycle, trend and irregular components [11].

To decompose the trend of a time series, the traditional method is to study the

autocorrelation and the partial-autocorrelation, and to fix the non-stationary trend

in order to make it stationary. Homogeneous non-stationary time series can be fixed

by differencing while other kinds of complex non-stationary property may need new

methods to solve.

The irregular component describes random and irregular influences in the time

series. This component may be decomposed and described by using statistical anal-

ysis. For instance, in dynamic linear models, the observation and the hidden average

are assumed to follow certain distributions.

In service industries, many of the time series data have strong periodicity. The

seasonality issue can usually be satisfactorily solved by the process of de-seasonalization.

After de-seasonalization, the time series can become stationary and many classical

methods can be applied on that stationary time series. For complex seasonal time

4
series, it can be multiple seasonal periods, high-frequency seasonality, non-integer

seasonality, and dual-calendar effects as shown in Figure 1 [12].

(a)

(b)

(c)

Figure 1.1: Examples of complex seasonality showing (a) non-integer seasonal periods,
(b) multiple nested seasonal periods, and (c) multiple non-nested and non-integer
seasonal periods

5
The seasonality can be found either by using Fast Fourier transform to analyze

the frequency components or by looking at autocorrelation function plot (ACF) to

find the lags with significant correlation.

In service industries, another challenge of time series prediction is low-count

time series pattern [13].In low-count time series, the counts in any given period are

sufficiently small that it may be unrealistic to forecast them with conventional mod-

els, including ARIMA, based on the normal distribution. Yelland proposed to use

dynamic linear models (DLM) to solve this type of problem. Figure 2 shows an

example of low-count time series.

Figure 1.2: An example of low-count time series which are sample inventory demands

One of the challenges of DLM is to determine the signal-to-noise ratio of the

time series which may need some experience on the data or a separated analysis to

obtain that information. This imposes inconvenience to practitioners.

This research presents a development of the dynamic linear model with appli-

cations on demand forecasting in service industries. In light of the above discussion,

the proposed method is to provide a framework that makes the dynamic linear model

ready for prediction as soon as data is ready.

6
1.2.3 Pattern-Based Online Prediction of Semi-periodic and Nonstationary Time

Series

Background In radiation therapy, it is important to give sufficient dose to tumor

and to reduce the damage to normal body tissues. To achieve that, the respiratory

motion in radiotherapy has to be accounted for. Currently, there are several methods

to account for respiratory motion: [5]

1. Motion-encompassing methods

2. Respiratory gating methods

3. Breath-hold methods

4. Forced shallow breathing with abdominal compression

5. Real-time tumor-tracking methods

Motion-encompassing methods are to estimate the mean position and range of

motion during CT imaging. Respiratory gating involves the administration of radi-

ation (during both imaging and treatment delivery) within a particular portion of

the patients breathing cycle, commonly referred to as the gate. Breath-hold meth-

ods are to control the tumor position for radiotherapy. For breast cancer, during

inhalation the diaphragm pulls the heart away from the breast, and thus there is

potential reduction of both cardiac and lung toxicity. Forced shallow breathing with

abdominal compression applies pressure to the abdomen to reduce diaphragmatic ex-

cursions, while still permitting limited normal respiration. Real-time tumor tracking

can in principle be achieved by using an MLC or a linear accelerator attached to a

robotic arm or, alternatively, by aligning the tumor to the beam via couch motion.

[5] To succeed, real-time tumor-tracking methods should be able to do four things:

(1) identify the tumor position in real time; (2) anticipate the tumor motion to allow

for time delays in the response of the beam-positioning system; (3) reposition the

beam; and (4) adapt the dosimetry to allow for changing lung volume and critical
7
structure locations during the breathing cycle. [5] In this dissertation, the prediction

of tumor position is studied. One way to predict the position is through the prediction

of respiratory motion which is a semi-periodic time series. In an example of tumor

located at superior segment of right lung in Figure 3 with a circle, respiration is the

dominant source of the tumor motion but other sources such as cardiac motion may

also be included in the time series [14].

Figure 1.3: A computer-simulated lung

The method proposed in this dissertation is designed for any time series that

shows the characteristics of semi-periodic time series. Other popular examples are

ATM cash demands and geo-data, such as sea level, sea temperature and seismic

activities.

Problem and Challenges Semi-periodic or quasi-periodic or quasiharmonic time

series refer to a signal that is virtually periodic, yet demonstrates both microscopic

and macroscopic variations. The characteristics of semi-periodic time series are drift-

8
ing in mean position, frequency and phase, and the occurrences can be considered as

random. Figure 4 shows the respiratory motion time series of lung tumor patients.

The respiratory motion patterns of patients demonstrate high individuality. This is

one of the challenges. To develop an application of tumor position prediction, the

method must be robust enough to give satisfactory results to all patients.

Figure 1.4: Some examples of respiratory motion time series

Another challenge is to fully use the historical data. Most of current state-of-

the-art prediction methods only consider local trends and are unable to take the whole

time series into account [15, 16, 17]. This wastes a lot of important information.

Therefore, in this dissertation, a pattern-based online prediction method is pro-

posed by using pattern recognition techniques to conquer the issue of the individu-

ality and to fully use all the available respiratory records. The prediction method is

to search for similar patterns from the history and then use the information of these

9
best matching patterns for prediction. There are two major challenges: 1) to find

the best neighbors which are the most relevant to the prediction problem and 2) to

maximally use the information of the obtained best neighbors.

In this study, we propose a weighted pattern-based variant best-neighbors pre-

diction method by using orthogonal polynomials approximations. This approach is

able to deliver satisfactory prediction results for semi-periodic time series.

1.3 Outline of the Dissertation

This dissertation focuses on the methodologies for addressing the two problems

as introduced in Section 2 which solve prediction problems in healthcare and service

industries. The problems will involve both stationary and nonstationary time series.

Figure 1.5 shows an outline of the dissertation.

Chapter 2 presents the details of application of ARIMA and Dynamic linear

model on stationary time series prediction problem for healthcare and railroad in-

dustries. ARIMA and DLM represent two different ways to explain and model time

series. The mechanisms of how they work and the limitations of the algorithms will

be discussed in the chapter.

Chapter 3 is dedicated to a prediction problem on semi-periodic time series.

Respiratory motion time series which is one kind of semi-periodic time series, is

selected to study in this chapter. Variant k-Best-Neighbors is used as the core method

for time series pattern matching. At the end of the chapter, the proposed method

will be compared to 3 state-of-the-art methods in respiratory motion prediction as

well as Seasonal ARIMA.

10
Figure 1.5: Outline of the dissertation

11
CHAPTER 2

ARIMA and Dynamic Linear Model for Time Series Forecasting

2.1 Literature Review

Due to the advance of information technology, there are more and more ways

to collect time series data. For example, consumer devices such as mobile phones

and laptop computers collect data and upload them to the Internet. Sensors such

as GPS and RFID can record positions with time stamps. Machines such as glass

making machines record quality measurements of products. Medical equipments such

as EEG and ECG record vital signals of patients. Time series data do not only

grow horizontally but also vertically such that more and more big data are available.

Increasing availability of time series data empowers us to obtain more knowledge via

techniques of statistics and data mining.

Tasks of time series data mining include indexing, clustering, classification, fore-

casting and anomaly detection [18]. Indexing assigns indices for a query of time series

to represent its similarity to a class. Prediction and certain analysis can be done by

using this similarity information. Clustering separates time series into groups based

on available independent variables. For each group, the time series show similar prop-

erties. Classification classifies time series into some predefined classes. Forecasting

models the underlying system and predicts future values. Anomaly detection finds

abnormality in the time series by comparing it to a benchmark of normal time se-

ries. This study will focus on the forecasting problem of time series data, which is a

critical concern in many applications, such as weather forecasting, natural disasters

forewarning, and prediction of epidemics and stock crashes [18].

12
Traditional time series data usually have relatively low dimensionality. While

data are becoming massive in volumes, traditional statistics/data mining techniques

are no longer able to cope with massive data. Also, due to high non-stationarity and

large amount of noises that may be present in some available time series data, tradi-

tional time series analysis tools such as ARIMA methods which assume stationarity

may be no longer suitable for these situations [18]. So it is necessary to find a way

to overcome the limitations of the traditional approaches and uncover complex and

hidden patterns in the massive non-stationary time series data.

Forecasting can be done by empirical qualitative analysis or mathematical quan-

titative analysis. In this perspective, forecasting methods can be broadly classified as

qualitative methods and quantitative methods. Empirical qualitative analysis such

as expertise, experience and intuition is useful when historical data are not avail-

able or irrelevant due to rapid change in circumstances [18]. Quantitative methods

can be further classified into causal and non-causal methods [1, 18]. Causal meth-

ods include Linear Regression, Econometrics Models and Artificial Neural Networks

(ANNs) models, where predictions are made based on data of relevant influential

factors. Non-causal methods include Moving Average [19, 20, 1, 21, 18], Exponential

Smoothing [1, 18], Box-Jenkins [19, 20, 1, 21, 18], State Space [1, 18, 22] and Spectral

Analysis [1, 18]. More details of quantitative methods are given in Figure 1.1.

Quantitative methods usually analyze some characteristics of the time series for

prediction, e.g., trend, seasonality, cycles and randomness. The trend of time series

tells us if it is increasing or decreasing and linear or non-linear. The seasonality tells

us if a pattern repeats at a fixed time interval. Cycle is very common in time series

data. The patterns may repeat at varying time intervals. Randomness makes pat-

terns difficult to identify and it is desired to identify the randomness from systematic

patterns [18].
13
Figure 2.1: Categorization of quantitative forecasting models

To evaluate the performance of forecasting models, mean squared error (MSE)

and its variants such as root mean squared error (RMSE), mean absolute error (MAE)

and mean absolute percentage error (MAPE) are commonly used. Another popular

performance measure is R-square, which represents the proportion of variability in a

data set that can be explained by the forecasting model. For model selection prob-

lems, Akaike information criterion (AIC) is often used, which adds penalty to model

complexity to discourage overfitting. However, AIC is not consistent. Bayesian infor-

mation criterion (BIC) and Hannan-Quinn criterion (HQC) are popular alternative

criteria to AIC and are consistent. Also, BIC generally penalizes free parameters

more strongly than AIC does. Cross validation is another model selection method.

In cross validation, data are divided into two sets, with one for training and the other

for validation. Considering the variation in the data, many different training and val-

14
idating sets will be used. Besides best subset selection, stepwise model selection, such

as forward selection and backward selection, are often used to find the best model.

2.2 ARIMA and Dynamic Linear Model

ARIMA methods are the most popular tools for time series forecasting and have

been applied in many different applications such as tourism forecasting [8, 6] where

Seasonal ARIMA is applied to determine the size of the flows of tourism demand

in Montenegro. Recently, the dynamic linear model (DLM) forecasting methods are

developed, which are shown to be advantageous in some cases, especially short-term

forecasting [6, 23]. In this section, basics and forecasting procedures of these two

methods will first be reviewed, and then the issues in using them in practice will be

discussed.

2.2.1 ARIMA

Basics Of ARIMA ARIMA stands for Autoregressive Integrated Moving Av-

erage. It models a time series in these three components. Hence, autoregressive mod-

els (e.g., GARCH), moving average models (e.g., SES, EWMA) and random-walk

models with or without trend are special cases of ARIMA models. For autoregressive

of order p (i.e., AR(p)), the current value depends on previous values plus the current

error term,

zt = φ1 zt−1 + . . . + φp zt−p + at (2.1)

where zt is the observation at time t, t is the regression coefficient of observation

at time t and at is the prediction error at time t. The Backshift operator B is

introduced to simplify the formula, where

15
zt−p = B p zp

Therefore, equation 2.1 can be rewritten as

(1 − φ1 B − · · · − φp B p )zt = at (2.2)

or

φ(B)zt = at (2.3)

For moving average of order q (i.e., MA(q)), the current value depends on the

current and previous error terms,

zt − µ = at − θ1 at−1 − · · · − φq at−q (2.4)

where t is the associated regression coefficient of observation at time t. Using

the Backshift operator notation, it can be expressed as

zt − µ = θ(B)at (2.5)

(a) Definition of Interval (b) Histogram of Interval

Figure 2.2: Examples of stationary and homogeneous nonstationary time series

Consequently, ARMA(p,q) process can be written as


16
φ(B)(zt − µ) = θ(B)at (2.6)

This model assumes that the underlying process is stationary which means that

the mean and variance are constant, and the autocovariances depend only on the time

lag. Figure 2.2a shows a typical example of stationary time series which resembles the

pattern of random walk. In contrast, Figure 2.2b shows an example of nonstationary

time series. The definition of stationarity can be expressed as follows:

• E(yt ) = µy for all t

• V ar(yt ) = E[(yt − µt )2 ] = σ 2 for all t

• Cov(yt , yt−k ) = γ for all t

Box and Jenkins [20] point out that homogeneous nonstationary sequences like

the data in Figure 2.2b can be transformed into stationary sequences by taking suc-

cessive differences of the series [19].

Similar to many earlier methods such as Holt Winters method on exponential

smoothing, ARIMA is able to model the seasonality of time series. The following

shows the multiplicative seasonal autoregressive integrated moving average model of

order (p, d, q)(P, D, Q)s :

φ(B)Φ(L)(1 − B)d (1 − L)D zt = θ(B)Θ(L)at (2.7)

ARIMA models assume the terms in the time series have linear relationships

and the residual follows normal or t distribution with a constant mean and variance.

Box and Jenkins [20] suggest a three-stage model building approach:

• Model Specification

• Model Estimation

• Diagnostic Checking

MODEL SPECIFICATION
17
The following rules are typically used in building ARIMA models [21]:

• Differencing (I) -

– Rule 1: If the series has positive autocorrelations to a high number of lags,

then it probably needs a higher order of differencing.

– Rule 2: If the lag-1 autocorrelation is zero or negative, or the autocorrela-

tions are all small and patternless, then the series does not need a higher

order of differencing. If the lag-1 autocorrelation is -0.5 or more negative,

the series may be overdifferenced.

– Rule 3: The optimal order of differencing is often the order of differencing

at which the standard deviation is lowest.

– Rule 4: A model with no orders of differencing assumes that the original

series is stationary (mean-reverting). A model with one order of differ-

encing assumes that the original series has a constant average trend (e.g.

a random walk or SES-type model, with or without growth). A model

with two orders of total differencing assumes that the original series has a

time-varying trend (e.g. a random trend or LES-type model).

– Rule 5: A model with no orders of differencing normally includes a constant

term (which represents the mean of the series). A model with two orders

of total differencing normally does not include a constant term. In a model

with one order of total differencing, a constant term should be included if

the series has a non-zero average trend.

• AutoRegressive (AR)

– Rule 6: If the PACF (Partial Autocorrelation Function) of the differ-

enced series displays a sharp cutoff and/or the lag- 1 autocorrelation is

positivei.e., if the series appears slightly ”underdifferenced” then consider

18
adding an AR term to the model. The lag at which the PACF cuts off is

the indicated number of AR terms.

• Moving Average (MA)

– Rule 7: If the ACF (Autocorrelation function) of the differenced series

displays a sharp cutoff and/or the lag-1 autocorrelation is negative,i.e.,

the series appears slightly ”overdifferenced”, then consider adding an MA

term to the model. The lag at which the ACF cuts off is the indicated

number of MA terms.

• AR and MA

– Rule 8: It is possible for an AR term and an MA term to cancel each

others effects, so if a mixed AR-MA model seems to fit the data, also try a

model with one fewer AR term and one fewer MA termparticularly if the

parameter estimates in the original model require more than 10 iterations

to converge.

• Unit Root

– Rule 9: If there is a unit root in the AR part of the modeli.e., if the sum

of the AR coefficients is almost exactly 1you should reduce the number of

AR terms by one and increase the order of differencing by one.

– Rule 10: If there is a unit root in the MA part of the modeli.e., if the sum

of the MA coefficients is almost exactly 1you should reduce the number of

MA terms by one and reduce the order of differencing by one.

– Rule 11: If the long-term forecasts appear erratic or unstable, there may

be a unit root in the AR or MA coefficients.

MODEL ESTIMATION

The ARIMA model can be written as

19
at = θ1 at−1 + · · · + θq at−q + zt − φ1 zt−1 − · · · − φp zt−p (2.8)

By assuming the error terms are identical and independently distributed by a

normal distribution with mean zero and variance σ 2 , we can use maximum likelihood

estimation to estimate θ and φ. Loss function can be derived according to the specified

model.

DIAGNOSTIC CHECKING As mentioned before, the basic assumption of ARIMA

is that the error terms are independently distributed by a normal distribution with

mean zero and variance σ 2 . For diagnostic checking, we need to check if the mean of

the residuals is close to zero, check the residual plot to see if the variance is constant

and check the autocorrelation plot to see if there is any violation of the assumption

of zero autocorrelation. The sample autocorrelations can be calculated as

Pn ¯ ¯
t=k+1 (ât − â)(ât−k − â)
râ (k) = Pn ¯2 (2.9)
t=1 (ât − â)

Forecasting Procedure Based On ARIMA After obtaining a time series model,

the forecast of zn+1 can be obtained by minimum square error estimation (MMSE).

The forecast of zn+1 is shown as the following conditional expectation

zn+1 = zn+l−1 + zn+l−7 − zn+l−8 + an+1 − θan+l−1 − Θan+l−7 + θan+l−8 (2.10)

Then, the forecasts can be obtained as

(1 − B)(1 − B 7 )zn (l) = 0 l>8 (2.11)

Therefore, the difference equation has the solution

(n) (n)
zn (l) = zn (r, m) = β0m + β1∗ r (2.12)

20
(n)
The forecast function is described by 7-time-unit levels β0m and a coefficient
(n)
β1∗ for the yearly trend change. Representing in autoregressive form, the forecasts

can also be interpreted in terms of exponentially weighted averages [22].

The above forecasting is 1-step-ahead forecasting. In some cases, k-step-ahead

forecasting may be desired. The mechanism of these two forecasting schemes is illus-

trated in Figure 2.3 and 2.4 The steps involved in the 1-step-ahead and k-step-ahead

forecasting are summarized below.

1-STEP FORECASTING Let t be the time and yt be the value at time t:

Figure 2.3: Illustration of 1-step forecasting

1. Train a model by using the training data set T = [1, 2, ..., a1].

2. Forecast only the next value (one-step forecasting) based on the trained model.

The Forecasting set is F = [a].

21
3. Repeat the process by adding the observed data point in training and moving

the forecasting period by 1 step, i.e., T = [1, 2, ..., a] and F = [a + 1].

4. To evaluate the performance of forecasting, we can calculate the forecasting

errors by comparing the forecasts to the observed data.

Figure 2.4: Illustration of k-step forecasting

1. Train a model by using the training data set T = [1, 2, ..., a1].

2. Forecast values with a fixed window τ (multiple-step forecasting) starting at ta .

The forecasting set is F = [a, a + 1, ..., a + τ ].

3. Repeat the process by adding the observed data point in training and moving the

forecasting period by 1 step, i.e., T = [1, 2, ..., a] and F = [a+1, a+2, ..., a+ +1].

4. To evaluate the performance of forecasting, we can calculate the forecasting

errors by comparing the forecasts to the observed data.

22
Issues Of ARIMA Forecasting ARIMA is a general time series analysis tool.

Under the framework of ARIMA, homogeneous nonstationary time series can be

transformed to stationary time series by differencing or logging. Time series often

show autocorrelation characteristics which means a large(small) previous value tends

to be followed by a large(small) future value. AR process is able to model this phe-

nomenon. Moreover, taking account of previous values is able to produce a more

accurate prediction of future values. Due to the generality, ARIMA models have

gained great popularity in time series analysis. However, there are three issues in the

use of these methods in practice:

1. Sensitivity to model specification: To apply ARIMA, we must first do

the model specification. As shown in Section 2.2.1, the process typically involves a

lot of personal judgments in determining the order of the AR and MA components

in the model. It requires experience to specify an appropriate model. Moreover, to

find a good model, a trial-and-error strategy may be followed which increases the

computational time.

2. Requirement of large amount of historical data: Large amount of

historical data are needed to build the ARIMA model. For instance, a model of order

(p, d, q)(P, D, Q)s has p + q + P + Q + d + mD number of parameters, and thus at least

p + q + P + Q + d + mD + 1 observations are required to estimate the parameters. As

a result, forecasting cannot be done at the beginning of the process, but has to start

after adequate historical data are available.

3. Assumption of deterministic mean: ARIMA assumes the underlying

process is deterministic, that is, the underlying mean is either a constant or has a

homogeneous trend which can be transformed. After transformation or differencing,

the mean of differenced time series should appear to be a constant. However, this

may not always be satisfied in practice. Due to the existence of random factors,
23
nonstationary time series are very common, especially data collected in short periods.

For such time series, the forecasting performance of ARIMA models may not be

satisfactory.

2.2.2 Dynamic Linear Models

Basics Of Dynamic Linear Models Dynamic Linear Models is a special type of

State Space methods for modeling time series data. It is a hierarchical model with

two levels: the mean model which represents the evolution of mean via state space

transition, and the observation model which models the observed values by taking into

account the mean evolution and observational errors. Figure 2.4 shows the structure

of the DLM. The basic elements of this model are as follows :

Observation model:

yt = µt + vt , vt ∼ N (0, V ) (2.13)

Mean Model:

µt = µt−1 + wt , wt ∼ N (0, RV ) (2.14)

where µt represents the hidden mean at time t, yt represents the observation at

time t, vt represents observational error and wt represents mean evolution. Note that

the variance of mean errors is R times of the variance of the observations, where R is

the signal-to-noise ratio, also called drift parameter. This model has three parameters:

the initial mean µ0 , the variance of observations V , and the drift parameter R. Usually

R is assumed to be known, while the other two parameters are unknown and need to

be estimated. This model is a basic DLM with only first order and constant variance

V.

24
Figure 2.5: Structure of the dynamic linear model

Typically the DLM is estimated using Bayesian methods [22]. The prior speci-

fication, resulting posterior and formulas for forecasting are as follows

Initial prior:

µ0 ∼ N (m0 , C0 V ) (2.15)

n0 d0
φ ∼ Gamma( , ) (2.16)
2 2
Posterior:

(µt |y1 , . . . , yt ) ∼ N (mt , Ct V ) (2.17)

nt dt
(φ|y1 , . . . , yt ) ∼ Gamma( , ) (2.18)
2 2
Updating recurrence relationships:

Ct + R
Ct = (2.19)
Ct + R + 1

mt = mt−1 + Ct (yt − mt−1 ) (2.20)

nt = nt−1 + 1 (2.21)
25
(yt − mt−1 )2
dt = dt−1 + (2.22)
Ct + R + 1
Forecasting:

ft = ȳ = mt−1

The initial prior for the mean is a normal distribution, and the prior for the

precision of mean (Φ = 1/V ) is a Gamma distribution. These are the conjugate priors

of the model, that is, the resulting posterior distributions have the same form of the

priors, except that the parameters need to be updated by equations 2.19 - 2.22. In

this framework, the forecast of the observation at time t is equal to the posterior

mean at time t1.

To apply the DLM model, the starting mean µ0 , the variance of observations V

and the drift parameter or signal-to-noise ratio R have to be estimated or specified.

For µ0 and V , the approach is to estimate them using historical data. For the signal-

to-noise ratio R, it is conventionally specified by users.

Forecasting Procedure Based On DLM The forecasting procedure based on

DLM has the following steps:

Step 1 : Before performing forecasting, the seasonality feature, if any, should be

removed by differencing the data,

ŷt = (1 − B S )yt (2.23)

Step 2 : Estimate the parameters of the initial priors using the historical data.

Suppose that the historical dataset contains H observations. m0 , C0 , n0 and d0 in the

initial priors can be estimated as follows

m0 = ỹ H (2.24)
26
C0 = 1 (2.25)

n0 = m (2.26)

d0 = n0 · var(y H ) (2.27)

m0 is the estimate of initial mean, µ0 . C0 is the estimate of the initial proportion

of mean variance to observational variance (or initial signal-to-noise ratio). n0 and

d0 are the degree of freedom and scale parameter of the Gamma distribution.

Step 3 : The forecast value ft can be obtained by the updating equations (2.16

to 2.19) and the forecasting equation (2.20).

Issues Of DLM Forecasting The DLM contains an important parameter, the

signal-to-noise ratio R, which is the ratio of the mean variance to the observational

variance. This parameter is typically unknown and needs to be specified by users.

However, there is no obvious way to determine its value, and users have to guess a

value based on their experience. This brings some inconvenience to the application of

this method in practice and may also affect its performance when the signal-to-noise

ratio is not specified appropriately.

2.3 The Proposed Method

For Dynamic Linear Models, the forecasting procedure starts with inputting the

initial priors and the signal-to-noise ratio, R. To determine the initial mean, µ0 , we

may use the information in the historical dataset. The variance of observations, V, is

assumed to be constant and unknown. It can be estimated by nt and dt in equations

(2.17)∼ (2.19). Finally, we need to specify the value of the signal-to-noise ratio, R,
27
then all parameters are all set and we are able to do forecasting. To make the process

of specifying R to be automatic, we propose a least square estimation method to find

an estimate for R which minimizes the mean square error of forecasting.

Figure 2.6: The proposed forecasting method with R estimates

Figure 2.6 illustrates the schema of the new forecasting procedure using the R

estimates. The left side of the diagram shows that the initial mean is estimated by

historical data as below:


1
Pnh
m0 = nh i=1 (Yi )

where nh is the number of historical data. The right side shows the forecasting

procedure with the updated R estimate. The new data are obtained sequentially

shown as blue squares. Each time a new observation is obtained, we will forecast

the next value by specifying the value of R which is automatically estimated by the

proposed method. nk is the total number of forecasting values. The forecasting

procedure of the proposed method can be summarized as follows:

Step 1 : Use historical data to estimate µ0

28
Step 2 : Assume the initial prior, C0 = 1 and use the proposed method to

estimate the signal-to-noise-ratio R


Pnk +nf
arg minr i=nk +1 (fi (r) − Yi ) 2

Step 3 : Update Ct in equation (2.16) using the R estimate Step 4 : Calculate

the forecast in equation (2.20) using Ct

Note that initially we assume the variance of hidden mean equals to the obser-

vational variance by assuming Ct = 1. When we have more data, Ct will be updated

and the ratio will become more and more accurate.

2.4 Numerical Study

A comprehensive numerical study is done to evaluate the performance of the

proposed forecasting procedure described in Section 2.3 and to compare the proposed

method to ARIMA and DLM methods described in Section 2.2. The scenario design

and computation procedure in the simulations will be presented in Section 2.4.1,

specific concerns to be addressed in this study will be given in Section 2.4.2, results

of simulations will be shown in Section 2.4.3, and our findings will be summarized in

Section 2.4.4.

2.4.1 Scenario Design and Computation Procedure

Some preliminary studies are first done to determine the ranges of the param-

eters V , nh , nf , R and rin such that patterns in the forecasting performance, if any,

can be captured. These five parameters are defined as follows in the simulation:

nh = [20 100 300 500 700];

nf = [20 100 300 500 700];

R = [0.001 0.01 0.02 0.05 0.08];

rin = [0.001 0.01 0.02 0.05 0.08 Rest ];


29
V = [10 20 30 40 50];

where nh is the number of historical data, nf is the number of forecasts, R

is the true value of the signal-to-noise ratio, rin is the specified value of R or R

estimate, Rest , obtained by the proposed method, and V is the observation variance.

For instance, because the historical data is only used for estimating the initial mean,

its range is selected to reflect the effect of accuracy of mean estimation. Also, small

values of rin is interested in this study because the signal-to-noise ratio is typically

very small in practice. Five fixed values are considered for each parameter to cover

three levels (small, medium and large). For rin , in order to compare the scenario with

specified r and the scenario with updated r by the proposed method, five fixed values

of r are considered along with the R estimate, Rest . To find out the effect of each

parameter, we only change one parameter and fix the others at typical levels. Details

of parameter settings are displayed in Table 4.1.

In the simulation study, 1000 time series are generated under each parameter

setting. For each time series, the data are generated through the following steps:

Step 1 : Specify parameters V, nh , nf , R and C0 as shown early in this section

Step 2 : Generate a random number µ0 , where µ0 ∼ N (m0 , C0 ) and m0 is set

to 0

Step 3 : Generate a random number v, where v ∼ N (0, V )

Step 4 : Generate a random number w, where w ∼ N (0, RV )

Step 5 : Calculate the hidden mean, µi by µi = µi−1 + wi

Step 6 : Calculate the observed value, Yi by Yi = µi + vi

Repeat Step 3 to Step 6 until i = nh + nf

The first nh will be used as historical data, while the rest of them will be used

in forecasting.

30
2.4.2 Concerns To Address

Our simulation aims to address the following questions:

Question 1: What is the performance of the updating-R procedure on R esti-

mation? Histograms are plotted to visualize the distribution of R-estimate in various

parameter settings. Moreover, to evaluate whether the procedure is able to accu-

rately estimate the true R, the ratio of R estimate to true R, r/R, is used to have

fair comparison. That is, r/R will be close to 1 if the estimation is good.

Question 2: What is the performance of ARIMA, DLM and DLM with updating-

R procedure and what are their strength(s) and weakness(es)? The root mean square

error (RMSE) is used to evaluate the prediction error and to compare the performance

of the three methods.

Question 3: How robust is each method to the specification of parameters, and

which method(s) is the most robust? Method(s) are always desired to be robust,

i.e., insensitive to setting of parameters. Particularly in our study, we also consider

variation of mean, i.e. mean drift, in time series.

Question 4: How do the parameters affect the performance on forecasting and

on R estimation (only for DLM with updating-R procedure)? Through the study of

the effect of parameters, we will be able to validate our method by evaluating whether

the responses are reasonable.

2.4.3 Results

Performance Of The Estimator Of R To show the performance of the pro-

posed estimator of R, typical levels of R are considered, in which the true R =

[0.001 0.01 0.02 0.05 0.08] and other parameters are set to be between in ranges

of typical levels shown in Section 2.4.1 in order to include a wide range of values while

obtaining analyzable time series. 2500 simulations are done and the R estimate is
31
obtained in each simulation. Figure 2.7 shows the number of forecasts, nf versus the

mean and variance of R estimates. From Figure 2.7a, we can see that the mean of R

estimates converge to the true value of R, which means that the proposed R estimator

is unbiased. Figure 2.7b shows that the variance of R estimate becomes smaller when

nf becomes larger, which is consistent with intuition.

Effect Of Parameters Simulations are done to study the effects of the parameters,

V, nf and R. Similar to the previous simulation, typical levels of the parameters are

set to include a wide range of possible values of the parameters while obtaining

analyzable time series. 2500 simulations are done and the R estimate is obtained in

each simulation. The parameter settings are shown in Section 2.4.1. Mean square

error (MSE) is used to measure the error of forecasting because it incorporates both

the variance of Y and its bias. The ratio between the estimated R and the true-R,

i.e. (r/R), is used to measure the deviation of estimated R from true R, i.e., r/R = 1

if r = R. It is useful to compare R estimates amongst time series with different true

R’s.

Figure 2.8 shows that higher V gives larger mean and variance of the error.

With the same signal-to-noise ratio, if the variance of the time series, V, is larger, the

evolution error of the time series will also be larger. DLM is to model the mean of

the time series, in other words, if V is larger, the range of the mean of the time series

will be larger. Therefore, the MSE of forecasting will be larger.

The distributions of rR under different V values are similar. No obvious trend

is observed by changing V. In conclusion, the value of V doesn’t affect the estimation

of R.

In Figure 2.10 and Figure 2.11, the graphs from left to right are the histograms

of MSE and rR respectively of increasing nf . V , nh and R are set to typical values

32
(a) Mean of R estimates (b) Variance of R estimates

Figure 2.7: An illustration of the definition of Interval and a histogram of the interval.

33
Figure 2.8: Histogram of MSE of V

Figure 2.9: Histogram of r/R of V

which are 10, 100 and 0.02 respectively. Figure 2.10 shows that higher nf gives more

precise error but it doesn’t help to reduce error. So, in our simulation study, the error

of forecasting is only determined by the range of the time series, i.e., larger fluctuation

will give larger error. However, increasing the number of data for forecasting, nf , can

help to increase the forecasting precision.

Figure 2.10: Histogram of MSE of nf

The distribution is skewed when nf is small and becomes more symmetric and

precise when nf is increasing. Figure 2.11 shows that the estimation of R is pretty

precise when nf = 500 . So, the forecasting precision is increased when the estimation

precision of R is increased.
34
Figure 2.11: Histogram of r/R of nf

In Figure 2.12 and Figure 2.13, the graphs from left to right are the histograms

of MSE and rR respectively of increasing R.V, nh and nf are set to typical values

which are 10, 100 and 500 respectively. Figure 2.12 and 2.13 show that larger R

gives slightly larger error but better R estimate. It is because that for larger R,

signal becomes clearer in contrast with noise. Therefore, the estimation of R is more

precise.

Figure 2.12: Histogram of MSE of R

Figure 2.13: Histogram of r/R of R

35
Figure 2.14: Design of experiment and the prediction performance (mean of MSE and
r/R of each set of experiment)

Forecasting Performance Of The Proposed Procedure In the simulation,

time series are generated by the following parameters: initial mean, variance, signal-

to-noise ratio, number of historical data and the number of data to be forecasted.

In order to study the effects of the parameters and to investigate the prediction

performance of the algorithms under various circumstances, five levels of values for

each parameter are carefully chosen. For each parameter, all of its levels will be tested

36
while other parameters are fixed to typical values so that the effects of each parameter

will be better visualized with more general time series. 1-step forecasting is used so

that the performances are clearly shown and compared by using simple forecasting

case. To compare the proposed method to the conventional method, two ways are

designed to specify the signal-to-noise ratio: use chosen fixed-R and use updating-R.

The prediction follows the trend of the time series even there is mean drift. In Figure

2.15 (upper panel), black line is the observed values and red line is the forecasts. In

the figure, we can see the mean drift from the observed data. Mean drift is defined

as random evolution of hidden mean value. In our model, we assume that the hidden

mean follows normal distribution with mean zero and variance wt (see eq. 2.25).

Therefore, the model performs well on time series data even with mean drift.

Figure 2.15: Performance of the proposed procedure for forecasting

Figure 2.15 (lower panel) shows R estimate of one simulation with parameter

V = 10, nh = 100, nf = 700 and R = 0.02. Again, we see that R estimate converges
37
to the true value asymptotically. It means that the proposed method can accurately

estimate the truth R and the model re-constructs the time series and predicts future

value very well.

Updating-R Vs Fixed-R In this simulation, we use the previous experimental

design again. Five levels are carefully chosen to represent various circumstances.

Applying the forecasting models, the mean square errors (MSE) of each method are

calculated and plotted in the same graph. The purpose of this simulation is to see

how well the updating-R method is over time, versus the fixed-R method.

Figure 2.16: Performance of forecasting with Updating-R vs. Fixed-R

From the graph, the performance over time is very clearly visualized. The red

line in Figure 2.16 shows the MSE of the model by using R estimate. The true R is

0.02 which is represented by the green line. The MSE of R estimate approaches to

that of the true R. If we set R to 0.05, the performance is slightly worse. If we set

R to 0.001, the performance is much worse. That means if we select a wrong R, the

38
performance can be significantly affected. And, our method can help to solve this

issue.

2.4.4 Summary of Numerical Study

The results in Section 2.4.3 address the concerns stated in Section 2.4.2. DLM

with updating-R procedure always performs better than fixed-R method. The pro-

posed R estimator is unbiased and converges to the true value when sample size is

large. DLM with updating-R is more robust than conventional DLM method in the

case of mean drift exists in the time series. In general, more forecasting data (nf )

gives smaller prediction variance and more accurate R estimate, larger observation

variance gives larger mean and variance of the error, and larger R gives slightly larger

error but better R estimate.

2.5 Case Study

Six real datasets are used in this study to compare the two methods described

in Section 2.2 and validate the effectiveness of the proposed method in Section 3.

These data are shown in Figure 2.17. Data 1 to Data 3 are counts of cargo loads in

railroad industries and Data 4 to Data 6 are counts of received calls in a nurse triage

call center. It is noticeable that Data 1 to Data 4 are stationary and cyclical with

about 7 days per cycle, while Data 5 and Data 6 are non-cyclical but may have mean

drifts.

Figure 2.18 shows the time series of Data 1 to Data 6 and their forecasting

results. Data 1 to Data 4 are de-seasonalized before input to the model. So they show

performing well on their cyclical properties in Figure 2.18. Data 5 and Data 6 do not

show any clear seasonal pattern so there is no straightforward way to eliminate their

cyclical components. Nevertheless, in the perspective of trend modeling, we see that


39
Figure 2.17: The data used in the case study

the forecasts of all datasets follow the trends of the time series. To tell the difference

of forecasting performance of cases, we have done some comparisons between ARIMA

and DLM methods.

To build ARIMA models for the time series, we need to specify a model first.

To do this, we need to look at the autocorrelation function (ACF) and partial auto-

correlation function (PACF) plots. The model building process of Data 1 is given in

the following as a demonstration.

From both plots, it is obvious that the cycle is 7 days. So we specify the

seasonality as 7. After specifying the seasonality, the ACF and PACF are plotted

again as follows. Now, we can see from ACF plot that there is autocorrelation at

lag 7. Therefore, we specify SMA as 1. Looking at PACF plot, it shows that there

is autocorrelation at lag 7 and 14, and maybe more but it’s not important because

40
Figure 2.18: Forecasting results of the six datasets

Figure 2.19: and PACF plots of original data in case 1

41
practically we don’t specify an order higher than 2. For simplicity, we would specify

SMA as 1. Again, ACF and PACF plots are plotted for the new model.

Figure 2.20: ACF and PACF plots of deseasonalized data in case 1

From both ACF and PACF plots, we can see autocorrelation at lag 1. Therefore,

MA and AR terms are both specified as 1. By plotting again the ACF and PACF

plots, we see the current model (1, 0, 1)(1, 1, 1)7 is much better.

However, according to experience, the model specification guideline does not

always give us the best model. Usually, we would try several more models to make sure

the best one is chosen. In this case, we would suggest users be aware of overfitting the

model. Therefore, we would try (0, 0, 1)(1, 1, 1)7 , (1, 0, 0)(1, 1, 1)7 and (0, 0, 0)(1, 1, 1)7 .

To build a DLM, we just need to follow the method in Section 2.3. Figure 2.3

shows the comparison of the results of the proposed method and the four ARIMA

models of the cases. Solid blue line represents the proposed method and colored dotted

42
Figure 2.21: ACF and PACF plots of model (0, 0, 0)(1, 1, 1)7 for case 1

Figure 2.22: ACF and PACF plots of model (1, 0, 1)(1, 1, 1)7 for case 1

43
lines represent ARIMA models with different model specification. From Figure 2.23,

it is clear that DLM performs better in case 5 and 6 than case 1 to 4. The reason is

probably that case 5 and 6 have mean drift which can be modeled by DLM but not

by ARIMA.

Figure 2.23: MSE of the two methods: DLM (dark blue) vs. ARIMA for 6 cases from
left to right and then up to down

Figure 2.24 shows the R estimate of the six cases. In case 1 to 4, the R estimate

approaches to zero which means the signal-to-noise ratio is very low. Case 5 and 6

shows a much larger R estimate. Recalling Figure 2.11 and 2.12, higher R values give

slightly larger error but if is much smaller than 0.1, the estimation will be biased and

have large variance. So, case 5 and 6 may give slightly larger error just due to having

large R. And, case 1 to 4 may inaccurately estimate their R’s. Therefore, we should

keep in mind these properties when we analyze Figure 2.24 and make any conclusion.

44
Figure 2.24: R estimate of the 6 cases from left to right and then. We see that except
case 5, the R estimate approaches to a stable value when there are more observations.

From Figure 2.25, Updating-R procedure is always better than Fixed-R method.

So, the proposed updating-R procedure is not only convenient which systematically

and automatically determines R but also better or at least as accurate as Fixed-R

method.

2.6 Summary And Future Work

Time series forecasting in service industries is a very challenging problem due

to the random patterns in the data. Many methods have been developed to solve

this problem and successfully applied in industries. However, there exist many issues

with these methods. For instance, the hidden mean of the time series in service indus-

tries may vary from time to time but ARIMA is not able to model this phenomenon.

Moreover, ARIMA needs a large amount of historical data to model a time series, but

in some cases this is not satisfied. Also, model specification sometimes imposes much

45
Figure 2.25: Updating-R vs. Fixed-R (R = [0.001 0.01 0.02 0.05 0.08])

difficulty to the practitioners. If a wrong model is specified, the result of forecasting

will be misleading. Therefore, people have developed DLM as alternative tools for

time series forecasting which is a special type of State Space methods. However, to

apply DLM, the signal-to-noise-ratio R has to be specified. Since the true value of R

is generally not available, the only way is to guess a value which is inconvenient and

unreliable. To conquer this problem, we propose a method to estimate R automati-

cally in the forecasting procedure. The properties of the proposed R estimator and

the new forecasting procedure with this estimator are studied by simulations. A case

study is also done where the proposed method is compared with the ARIMA method

using six datasets from service industries. It is found that this method outperforms

ARIMA when a time series has mean drift.

There are some open issues in this study which will be considered in my future

research. The following are two examples:

46
1. The variation of signal-to-noise ratio: Besides the issue of unknown

signal-to-noise ratio, in DLM, we assume the signal-to-noise ratio as a constant. In

practice, it is possible that the signal-to-noise ratio is not a constant but changes

over time. It will be interesting to study the behavior of a changing R and develop a

forecasting procedure with appropriate R estimates.

2. Variation of cycle or seasonality: West et al. [22] suggests two ways to

cope with cyclical feature and seasonality: Form-Free Seasonality Model and Fourier

Form Representation of Seasonality. In form-free seasonality model, he suggests first-

order and second-order polynomial to model the seasonality. In the method of Fourier

form representation of seasonality, he suggests to break down the time series into many

harmonic cycles to represent the seasonality. Application of these methods will be

considered in my future research.

47
CHAPTER 3

Pattern-Based Real-Time Prediction Of Semi-Periodic And Nonstationary Time

Series

3.1 Introduction

Respiration is an involuntary action, yet, within limits, individuals are capable

of controlling their own breathing [5]. The respiratory motion is mainly regulated by

the level of the partial pressure of carbon dioxide. Higher level of the partial pressure

of CO2 means more urge to breathe [5]. Besides physical factors, environmental and

psychological factors may also contribute to the variation. So, respiratory motion

time series is virtually periodic and time-varying.

The occurrences of drifts can be considered as random. One thing should be

noted is that both inter-individual and intra-individual variations are usually signif-

icant. Fortunately, respiratory pattern usually shows some statistical properties and

can be used for the prediction of tumor position. Much efforts has been devoted to

mining these properties and using them for respiratory motion prediction. In consid-

eration of high inter- and intra- individual variation, an adaptive method that works

well for all kinds of patients in all time is desired. The importance of respiratory mo-

tion prediction will be introduced in the following. In robotic radiosurgery or medical

imaging such as positron emission tomography (PET) or computed tomography (CT)

scan, all devices which involve tracking tumor position during treatment, suffer from

system latencies [5, 24]. System latency is the required response time of the whole

system from the time of detection to assert treatment action.

48
If a tumor is in the thoracic or upper abdominal region, respiratory motion will

be the dominant factor for tumor movement [25]. Without accounting for the res-

piratory motion, critical misalignment between irradiated field and the target tumor

volume in a treatment fraction may occur during radiotherapy, and impose signif-

icant radiation dose to normal body tissues. To account for respiratory motion in

radiotherapy, there are 5 ways [5]: motion-encompassing methods, respiratory gating

methods, breath-hold methods, forced shallow breathing with abdominal compres-

sion, real-time tumor-tracking methods.

The first 4 methods usually require patients’ participation which is inconvenient,

may not be available to all patients. Respiratory gating methods are first adopted in

Japan in the late 1980s [5]. Following its success, the method becomes popular and

much research efforts is put into this technique.

Another method which is developed to account for respiratory motion during

radiotherapy, is real-time tumor-tracking. One of the most famous systems, Cy-

berKnife Robotic Linear Accelerator, is a realization of real-time tumor tracking.

Real-time tumor tracking can utilize the total duty cycle without any interruption.

This method requires the least participation of human which may enhance the reli-

ability. Also, time can be saved which means more service can be provided within

a limited time span. To succeed, this method should be able to do four things: (1)

identify the tumor position in real time; (2) anticipate the tumor motion to allow for

time delays in the response of the beam-positioning system; (3) reposition the beam;

and (4) adapt the dosimetry to allow for changing lung volume and critical structure

locations during the breathing cycle. For gating method, special caution should be

taken by the therapist if the breathing pattern is different from the simulation. This

problem does not exist in Real-time tracking method as long as the system can do

the aforementioned four things. In this chapter, we discuss in detail of the second
49
Table 3.1: A list of latancies of different systems.
VERO MLC MAD CyberKnife
Position acquisition 25 309 30 25
Position calculation 2 20 - 15
Gimbals/MLS/robot control cycle 20 52 45 75
Other - 38 100 -
total 47 420 175 115

task: predicting the tumor motion to compensate the system latency which ranges

from 115 ms to 420 ms for different systems [26, 24].

The current generation of the CyberKnife has a latency of about 115 ms, down

from 192.5 ms in the previous version, which is still widely in use [26].

3.2 Related Work

Many prediction methods for respiratory motion have been developed for ra-

diation therapy to compensate the system latencies. The following list contains the

methods that are under spotlight in recent years.

Through this study, we have reviewed some latest methods. These includes

Time Varying Seasonal Autoregressive with Residual Adaptive(TVSAR)[27, 17], Neu-

ral Networks (NN) [28], Kernel Density Estimates(KDE) [29], Support Vector Re-

gression Prediction - SVRpred [16, 24], Recursive Least Squares (RLS) [24], The

MULIN Algorithms [24], Normalized Least Mean Squares, Wavelet-based Multiscale

Autoregression[24, 30], Wavelet Neural Network[31] and EKF Frequency Tracking.

Ernst [24] did a survey in 2013 on some of these methods. He concluded that

Wavelet-based Multiscale Autoregression (wLMS) [15, 30] has the best per-

formance in short term prediction, and Support Vector Regression prediction

(SVRpred) [16] which is developed based on Accurate Online Support Vector Re-

gression proposed by Renaud [30], performs better in longer term prediction. Support

50
vector regression (SVR) has been widely applied on respiratory time series prediction

[32, 16, 24, 33, 34, 35, 36]. For all of these current methods of SVR, the coefficients

are trained by either using the whole time series to capture all possible information

or using a sliding window to capture recent development. We will show that through

pattern matching, better prediction can be obtained by only selecting similar patterns

as inputs of SVR. Ichiji [37, 27, 17] proposed resi-TVSAR in 2013 and reported very

good performance of the method. Therefore, we select these two methods to compare

with our proposed method. We have selected TVSAR, wLMS and SVRpred. We

will also compare our proposed method to Seasonal ARIMA which is a very popular

classic method.

3.2.1 Time-Varying Seasonal Autoregression (TVSAR)

The time-varying periodic nature of respiratory motion makes the prediction

challenging. Most of methods which assume constant periodicity, do not apply to

this problem. ARIMA which is a very popular method, provides a general framework

which can model linear and stationary time series or homogeneous non-stationary

time series. Seasonal ARIMA (SARIMA) is developed to further cope with con-

stant periodicity. To overcome the time-varying periodic nature, Homma proposed a

modified SARIMA [37, 14] in 2009. The method converts the time-varying periodic

component to a constant periodic one by adjusting the time variation. However, be-

cause the time-varying periodic component is random in nature, it is very hard to

accurately convert it to a constant form. The outcome of modified SARIMA is not

satisfactory. In 2012, Ichiji et al [37] proposed TVSAR which is an AR model but

takes the varying cycles into account. The following is the detail of the method.

51
The N th SAR model of time series y(t), t = 1, 2, . . . is given as follows [37, 27,

17]:
N
X
y(t) = (t) + Φn · y(t − n · s) (3.1)
n=1

where Φn , n = 1, 2, , N are SAR coefficients, N is the order of SAR model, s is the

period of the target time series y(t), and (t) ∼ N (0, σ 2 ) is a Gaussian noise.

Then, the SAR model-based equation for h-sample ahead prediction is given by

putting t as t + h:
N
X
ŷ(t + h|t) = Φ̂n · y(t + h − n · s) (3.2)
n=1

We can note that this assumes a constant prediction horizon h for all time-varying

intervals.

To overcome the limitation of the general SAR model, TVSAR introduced a

time-varying and irregular interval, instead of a constant period s.

The TVSAR model can then be written as:


N
X
y(t) = (t) + Φn · y(t − r̂n (t|t)) (3.3)
n=1

The prediction equation of the N th TVSAR model for prediction horizon h is given

as:
N
X
ŷ(t + h|t) = Φ̂n · y(t + h − r̂n (t + h|t)) (3.4)
n=1

where r̂n (t|t) > 0 are called reference intervals for indicating the past observed values

at a corresponding phase to the current value y(t). The reference intervals are the

key part of TVSAR. By calculating the correlation to the past data to do a pattern

matching, reference intervals is found. In the other words, the reference intervals are

the points in the past which are at the same phase as the current value, i.e. y(t). An

SAR model is then built by using the points which are in the past few cycles and are

at the same phase as the to-be predicted value.


52
The reference intervals are found by finding the points which maximize the

correlation between the past data and the current window. The estimation procedure

is as follows:

At time t, calculate a correlation function of lag k=0,1,2, given by


w−1
1 X y(t − j) − µt y(t − j) − µt−k
C(t, k) = (3.5)
w j=p σt σt−k

where t and t are the sample mean and variance of a subset time series with length w

described as [y(t − w + 1), y(t − w + 2), . . . , y(t)]. Figure 3.1 illustrates the estimation

procedure [17, 37]. The nth reference interval is estimated by finding the lag k

Figure 3.1: Estimation procedure for reference intervals

which obtains nth local maximum the correlation function C(t, k). The k is described

as:

r̂n (t|t) = arg max C(t, k) (3.6)


k

53
where the search range is set as half of w around the reference intervals found at time

t − 1, i.e. rn (t − 1|t − 1) − w/2 < k < rn (t − 1|t − 1) + w/2 for each n.

Update the subset length by w = ba · r̂1 (t|t) + 0.5c . Here a is a coefficient

to adapt the length based on r̂1 (t|t). The initial reference intervals used for the

estimation procedure were given as:

r̂n (ts |ts ) = n × ŝ (3.7)

The major issues of TVSAR are:

1. It does not take the baseline shift and amplitude change into account.

2. It is hard to maintain an effective window size which makes the search of refer-

enced intervals difficult. It is either susceptible to noise by using small window

or overlooking a potential similar phase by using larger window.

3. It assumes a fixed prediction horizon, h, for all reference intervals which obvi-

ously have various lengths.

3.2.2 Wavelet-Based Multiscale Autoregression

Since respiratory motion is a mechanism of coordination of multiple muscles

and organs, respiratory motion time series is actually a record of activity of the chest

motion at particular location under coordination of multiple body parts within a

definite time horizon. Simply speaking, respiratory motion time series is a mixture of

various signals. Figure 3.2 shows an example of wavelet decomposition of respiratory

motion with 3 levels. Each band has its own pattern. By using wavelet decomposition,

it can enhance the prediction power of autoregressive model. Renaud et al.

[30] proposed to use á trous wavelet decomposition to build an autoregressive model

54
Figure 3.2: Wavelet decomposition of a respiratory motion time series

for prediction. He provided the following close form equations of the coefficients of

wavelets for subsequent time t. Therefore, online updating of wavelets is available.

1
c0,n = yn , cj+1,n = (cj,n−2j + cj,n ), Wj+1,n = cj,n − cj+1,n
2

A signal, y is decomposed into J levels discrete wavelet scales, W1 , ·, WJ , and the

smoothed signal, cJ by passing low-pass and high-pass filters with particular ranges

of frequencies, i.e. yn = W1,n + · · · + WJ,n + cJ,n . Also, aj and aJ+1 denotes regression

depth of level Wj and the smoothed signal cJ respectively. The multiscale autoregres-

55
sive (MAR) forecasting can be done by building up an autoregressive (AR) prediction

model for each wavelet scale and then sum up all of the prediction:
J
X
M AR
ŷn+k = wjT Ŵn,j + wJ+1
T
ĉn (3.8)
j=1

Ŵn,j = (Wj,n−2j ·0 , Wj,n−2j ·1 , . . . , Wj,n−2j ·(aj −1) )T (3.9)

ĉn,j = (cJ,n−2J ·0 , cJ,n−2J ·1 , . . . , cJ,n−2J ·(aJ+1 −1) )T (3.10)

An example with aj = 3 for all j and J = 2 is illustrated by Figure 3.3. The

Figure 3.3: An example of an order-3 AR model built by 2-level wavelet scales

weights of each AR model, wj , are learnt adaptively by least mean squares prediction

error of a window of data.

B = (ln−k , . . . , ln−k−M +1 )T ,
T T
lt = (Ŵt,1 , . . . , Ŵt,J , ĉTt )T (3.11)

w = (w1T , . . . , wJ+1
T
)T , sn = (yn , . . . , yn−M +1 )

Solve Bw = sn by normal equation to update w. B denotes the wavelet decomposition

at time n − k with window size M .

To cope with the regularity of the normal equation used to solve for w, Ernst

[15] replaces (B T B)−1 B T by the Moore-Penrose pseudo inverse of B. Ernst suggests

56
an exponential averaging parameter µ to include possible missing information which

is not included in the current signal window. Finally, wLMS is defined as follows:
J
X
M AR T T
ŷn+k = wn,j Ŵn,j + wJ+1 ĉ (3.12)
j=1

T
ŵn = (wn,1 , . . . , wn,J+1 ) (3.13)

w1 = · · · = wk+M −1 = (0, . . . , 0)T (3.14)

wn+1 = (1 − µ)wn + µB+ sn , µ ∈ [0, 1], n ≤ k + M (3.15)

Note that wLMS only uses the latest data to build a model. It performs very well on

very short term prediction but its medium to long term prediction ability is unsatis-

factory.

3.3 Pattern-Based Variant Best-Neighbors Prediction Using Raw Data

Respiratory motion is a type of semi-periodic motion which shows periodic-

ity with variation on mean position, phase and frequency. The occurrence of these

changes is due to complex causes and can be considered as random which means ,for a

pattern, a future value has an expected value with a variance. Even though the future

value of individual pattern is random, a collection of similar patterns will give very

accurate prediction. Our study of tumor position prediction shows that the average

of these responses provides a very accurate and effective prediction to the respiratory

motion.

In this study, a pattern-matching based framework is employed to discover

similar patterns from the past record and exploit the information of those patterns

for prediction of tumor position. Figure 3.4 shows the general approach of the pattern

based online prediction framework. Instead of only using recent cycles or using the

whole time series to train a model like most people do, an effective and accurate way
57
is to look for similar patterns from the past record and analyze the information of

those patterns to do prediction.

Figure 3.4 shows scatter plots and plots of autocorrelation function of height of

cycle of a respiratory motion time series versus their 1st lags. It shows that the height

of respiratory motion is autoregressive which provides us a strong basis of using

pattern matching to make prediction because similar patterns should have similar

response stochastically.

Figure 3.4: The general approach of the proposed pattern-based Variant-Best-


Neighbors prediction by using raw data

We introduce a pattern-based Variant Best Neighbors (VBN) prediction method.

The number of best neighbors which varies, is determined by a pattern-similarity

threshold and a cutoff value (k). The general approach is shown in Figure 3.4. Be-

fore starting prediction, the ratio of window size to cycle length (R) is needed to be

58
Figure 3.5: Three best neighbors (solid black lines) of the current segment (solid blue
line), the dotted lines are their ”future” values

determined by training and validation. In fact, the first step itself in the flowchart is

a prediction process in order to try various parameter sets.

Through validation, a parameter set that gives the most accurate result is se-

lected. After obtaining the optimal window ratio, a pattern library is built by using

the selected R. Then, we determine the best matching patterns from the pattern

library by a variant-best neighbors (VBN) approach. Next, the optimal subset of

predictive patterns for prediction is decided by statistical and feature analysis. The

previous step roughly provides us generally most matching patterns. This step is to

further refine the set of best neighbors(BNs) in order to significantly enhance the

prediction performance. After obtaining the BNs, we then can use their information

to make prediction by either simply taking the average of the ”future” values of the

BNs or by applying support vector regression. The performance of both methods

will be discussed in result section. Figure 3.5 illustrates a general approach. Three

patterns similar to the current segment at time t are found. The solid lines represent
59
Scatter plot of Height vs Height lag−1 Sample Partial Autocorrelation Function of Height
0.6

Sample Partial Autocorrelations


0.4
0.4

0.3
0.2
0.2

0
0.1

0 −0.2
−0.2 0 0.2 0 2 4 6 8 10 12 14 16 18 20
Lag

Scatter plot of Interval vs Interval lag−1 Sample Partial Autocorrelation Function of Interval
160 0.6
Sample Partial Autocorrelations

140
0.4
120

100
0.2
80

60
0
40

20 −0.2
−100 −50 0 50 100 0 2 4 6 8 10 12 14 16 18 20
Lag

Figure 3.6: Scatter Plots (left) and Autocorrelation Function (right) of the height
and the interval of respiratory motion versus its 1st lag

the current segment and their BNs. The dotted lines represent the future values of

the best neighbors and the current segment.

3.3.1 Personalized Pattern Monitoring Window Size

In our study, the respiratory motion data of 27 patients are studied. The length

of the data ranges from about 30 minutes to 60 minutes. The first 60% of the total

data is used to build up a pattern library; the next 20%, but at most 6,000 points,

of the data is used for validation and the remaining for testing. The median of the

intervals which is measured by the distance between consecutive peaks as shown in

60
Figure 3.7a, is used as a baseline for window sizes of two-window design. Median is

used because of the skewness of the distribution of the interval as shown in Figure

3.7b.

(a) Definition of Interval (b) Histogram of Interval

Figure 3.7: An illustration of the definition of Interval and a histogram of the interval.

To reasonably determine the final window size, we find the optimal ratio of

window size to the median of the peak-to-peak intervals, R, to control the window

size. Median is used because of the skewness of the distribution of the intervals. The

R is determined by validation. The window sizes are then multiplied by the selected

ratio, L = Lj × R for j = 1, 2, where L1 and L2 are the windows size for smaller and

larger windows respectively. Pattern libraries of all window size Bn×Lj are built up

where n is the number of time series segments in the library and Lj is the size of j th

window. In order study, the ratio R for smaller and larger windows are set to be 0.5

and 1.5 respectively.

In validation process, one window ratio is picked each time. R-square is used

for performance measurement because it provides a universal metric that describes

how close the prediction is to the real data.

61
Pt=n  2

t=1 (ŷ(t, R) − ȳ)
R̂ = arg max 1 − Pt=n 2
(3.16)
R t=1 (y(t) − ȳ)

h=1 h=5 h=10


0.9994 0.995 0.975

0.97
0.9992 0.99
0.965

0.999 0.985 0.96


0.5 1 1.5 0.5 1 1.5 0.5 1 1.5
h=15 h=20 h=25
0.94 0.9 0.86

0.93 0.88 0.84

0.92 0.86 0.82


0.5 1 1.5 0.5 1 1.5 0.5 1 1.5
h=30
0.8
Accuracy

0.78

0.76
0.5 1 1.5
Ratio of window size to its median (R)

Figure 3.8: The prediction accuracy for various window length with window
ratio to median interval (R) ranging from 0.3 to 1.5 for prediction horizon
h=1,5,10,15,20,25,30 for patient 16

From Figure 3.8, we find out that, for patient 16, if only one window is used

for prediction, shorter window size is better than longer window size. For longer

prediction horizon, we observe there are two local maximum at R=0.5 and 1.2.

The 3-D plot of the prediction accuracy for various Rs shows that the prediction

accuracy mostly depends on the prediction horizon but we can observe that the effect
62
Prediction performance for various window size of Patient 16

1
Prediction accuracy (R2)

0.95

0.9

0.85

0.8

0.75
0 0
10 0.5
20 1
30 1.5
Ratio to median interval
Prediction horizon (h)

Figure 3.9: A 3-D plot of the prediction accuracy for various window length with
window ratio to median interval (R) ranging from 0.3 to 1.5 for prediction horizon
h=1,5,10,15,20,25,30 for patient 16

of window ratio to median interval (R) becomes more significant when the prediction

horizon (h) goes up.

The window ratio R that maximizes the R-square is selected for prediction.

Table 3.3 presents the results of using adaptive ratio. It shows that with adaptive

ratio, the prediction is a little more accurate.

However, for now, only 3 ratios - 0.75, 1, 1.25 - are considered in the experiments

and one ratio is used for all windows in each time. In the future, we will study how

63
to optimize the window ratio for each window in order to obtain optimal result from

the current method.

3.3.2 Variant Best-Neighbors-Based Predictive Pattern Selection

Phase I: Searching for Initial Best Neighbors By Using Right Aligned

Patterns Figure 3.10 illustrates the phase I of VBN approach. In phase I, the

Figure 3.10: A flow chart of the VBN procedure: Phase I

best matched patterns are discovered by searching from the pattern libraries of two

window sizes, L1 and L2 .

Using the current segment to look for patterns that their similarity measures,

S defined as equation 3.18, satisfy a similarity threshold, θ. The baseline of candi-

dates must be removed because we are only interested in their patterns. The regular

VBN patterns have the problem of not considering signal shifting. Thus, it leads to
64
inaccurate prediction due to the shifted errors as shown in left graphs of Figure 3.11

and 3.12. To achieve accurate prediction, we propose to align the patterns at their

rightmost point during VBN searching process. As shown in Figure 3.11 and 3.12,

the right alignment puts higher weights on the right end and helps to obtain best

neighbors that have better match on the right end which we have found that is more

important for prediction.

ũn = un − un (end) (3.17)

where un is the segment of nth candidates in the pattern library and u0 is the current

segment. In this study, we introduce a usage of R-square as a similarity metric of two

line segments.
Pi=Lw
(un (i) − ū0 )2
Sn = 1 − Pi=1
i=Lw
(3.18)
i=1 (u0 (i) − ū0 )2

Figure 3.11 and 3.12 show the close views of typical examples of best neighbors

found by comparing raw patterns and right aligned patterns with the current segment.

First of all, From Figure 3.11 and 3.12 we see that using different alignment gives us

different best neighbors. Even though the best neighbors found by raw patterns may

show high similarity in overall but the right adjusted best neighbors show a better

matching at the right side which is the closest point to the point that is going to

be predicted. From Figure 3.11 and 3.12, we see that right aligned BNs give better

prediction result.

Then, we sort the list Sn in ascending order to begin acquisition of BNs. We

iteratively obtain the BNs from the top of the list until at least k BNs are obtained

and the next S1 is smaller than the threshold, θ. One important thing has to be

65
Figure 3.11: Online prediction of a patient’s respiratory data by using unaligned
BNs(Left) and right-aligned BNs(right). Belows are the best neighbors marked with
vertical lines in the time series.

Figure 3.12: A zoom-in view of Figure 3.11, using unaligned BNs(Left) and right-
aligned BNs(right). We can see that right-aligned BNs is obviously better than un-
aligned BNs.

66
done is to remove the candidates which are adjacent to the selected BNs in order to

prevent bias. Our removal strategy is shown as below:

Bk = {u ∈ Bk−1 |tu < tλk−1 − m ∪ tu > tλk−1 + m} (3.19)

where Bk denotes the library at the time of after entering the kth best neighbor; tλk

denotes the time at the end of k th best neighbor λ; and m denotes a small distance that

the candidates within this range are excluded. In the respiratory motion prediction

study, we choose the distance as one-fifth of the median of peak interval, i.e. b =

0.2 × median. Then, a list of BNs, Bλ , is obtained at the end of Phase I.

Phase II: Further Refining The Obtained Best Neighbors By Considering

A Larger Window In phase I, by using a short window, the best neighbors are

the patterns best matched in short term range. That is an important step to attain a

high accuracy in short term prediction. The better the matching in the most recent

data is, the closer the future trend of the best neighbors will follow. A short window

does better job in this issue.

The next step is to consider the pattern matching in a longer horizon of the best

neighbors obtained from the previous step in order to guarantee a correct matching

in the phase of respiratory motion. In other words, we need to look closer to have a

clear picture of what is short term trend in the current moment, then we look at a

bigger picture to figure out at what phase of a cycle the respiratory motion is.

The window sizes of the shorter and the longer windows are two parameters

needed to be decided. The optimal sizes can be determined by considering various

values in validation process. In validation, the shorter and the longer windows are

initially set to be 0.5 and 1.5 of the median cycle length. Then, the window sizes are

then multiplied by several ratio, R. The best ratio will be selected for the patients

individually.
67
In this phase II, to search for the best neighbors from the best neighbors ob-

tained in phase I, we repeat the process of phase I. Just that, in this time, the window

size changes to a longer window.

After finalizing the set of BNs, prediction is done by using the ”future” infor-

mation of the BNs. The simplest and effective method is taking an average of their

”future” values as equation 3.25. For some cases, when the best neighbors do not

match very well, we may consider support vector regression (SVR) and we name this

method as Right Adjusted Pattern-Based VBN-SVR Prediction (RPKS). The next

section will discuss about this.

Phase III: Best Neighbors Removal Using Statistical Analysis Figure 3.13

shows the scatter plots of the errors of short segments before and after tλk of the first

eight best neighbors of patient 23. Figure 3.15 is a close view of the best neighbors at

around tλk . The blue solid line is the current segment. This example shows that higher

error at the left side implies higher error at the right side. The correlation is obvious

at this example. Figure 3.14 is a drawing that clearly illustrates the phenomena.

The error of the mismatching of a short segment with length l just before and

after time t are significantly correlated, where t is the current time point of the current

segment and the corresponding time points of the candidates. To further refine the

set of the candidates, we suggest to remove those candidates with bigger mismatching

error of a few points just before time t. The sum of square errors of matching of a

short segment of λk with data Dt at time t is shown as below:


0
X
d λk = (Dtc +i − Dtλk +i )2 (3.20)
i=−l+1

68
Figure 3.13: Scatter plots of the error before tλ vs the error after tλ . Correlation
between the errors is observed.

Figure 3.14: An illustration of the error of the best neighbors before and after time
tλk . If the error at the left hand side is large, then error at the right hand side is also
likely to be large.

Next, since the distribution of error are skewed and the skewness varies among indi-

viduals, we remove the candidates with error larger than median plus one and a half

median absolute deviation(MAD) as follows:

B̃ = {λ ∈ B|dλ <= dmedian + 1.5 × dM AD } (3.21)


69
Figure 3.15: A real example of the error of the best neighbors before and after time
tλk

Although it is rare, sometimes the predicted value of a best neighbor is an

outlier among other candidates as shown in Figure 3.16. So, the last step of Phase

III is to remove those candidates as follows:

B̌ = {λ ∈ B|D(tλ + h) < min(max(D(tλ + h)),


(3.22)
Pλ75 + 1.5 × (Pλ75 − Pλ25 ))}

Figure 3.17 show two example of best neighbors without outliers. These examples

3.3.3 Online Prediction Frameworks Using the Selected Predictive Patterns

Prediction Using Average of the Future Values of Reference Patterns For

the best neighbors, the expected prediction values of h samples ahead is assumed to

be similar.

E[y(t + h)] ' E[y(tk + h)] f or k = 1, . . . , K (3.23)

70
Figure 3.16: An example of an outlier in the best neighbors

Figure 3.17: Another example of best neighbors without any outliers

where K denotes the number of the similar patterns, y(tk + h) denotes the value at

h samples ahead of kth referenced pattern. Therefore, the prediction of h samples

ahead made by using k best neighbors can be written as:


K
X
y(t + h) = (t + h) + Θk · y(tk + h) (3.24)
k=1

(t + h) denotes the error of predicting the value at time t + h and Θk denotes the

coefficient of the referenced value of k th BN. (t + h) includes the random error and

pattern mismatching error.

71
Taking the average of the referenced values, i.e. the samples which are h samples

ahead of all BNs, for prediction, the proposed model equation can be written as:
K
X
ŷ(t + h) = Θ̂k · ŷ(tk + h) (3.25)
k=1

1
We set Θk = K
to use the mean of the future values of the referenced patterns

for prediction.

Prediction Using Support Vector Regression Bootstrap aggregating, also called-

bagging, is an appropriate way to control and check the stability of the results, and

is asymptotically more accurate than the standard intervals obtained using sample

variance and assumptions of normality. By careful choice of the size of the resamples,

bagging can lead to substantial improvements of the performance of the kNN method.

Adr et al.recommend the bootstrap procedure for situations when the theoretical dis-

tribution of a statistic of interest is complicated or unknown and when the sample

sizeis insufficient for straightforward statistical inference.

In the proposed method, sometimes there is only a small number of nearest

neighbors obtained by considering R-square=0.95 as similarity threshold. In this case,

the sample size is too small to be sufficient for straightforward statistical inference.

Then, bootstrapping may help to control and check the stability of the results by

looking at the bootstrapping confidence interval.

The Right-aligned Pattern-based Variant-Best-Neighbors Prediction by Boot-

strapping Average (RPKM) is defined as:

M N
X 1X
ŷ(t + h) = ( y(tmn + h)) (3.26)
m=1
n n=1

If the referenced values of the nearest neighbors are normally distributed, we

may just directly use simple average and standard interval. In this case, bootstrapping
72
average and confidence interval are asymptotically consistent to simple average and

standard interval, so there is no benefit to use bootstrapping.

Check for Normality Therefore, before using bootstrapping, we use Kolmogorov-

Smirnov test to check for the normality of the referenced values of the nearest neigh-

bors. The null hypothesis states that the population is normally distributed.

Figure 3.18 shows the Kolmogorov-Smirnov test of patient 2 and prediction

horizon h = 15. Zero means fail-to-reject null hypothesis while one means rejecting

null hypothesis.

Kolmogorov−Smirnov Normality Test


Test result (1=reject null hypothesis)

0.8

0.6

0.4

0.2

−0.2
4.4 4.6 4.8 5 5.2 5.4
time x 10
4

Figure 3.18: Kolmogorov-Smirnov test during prediction of the respiratory motion of


patient 2 with prediction horizon h = 15

73
Prediction Using Support Vector Regression Best neighbors can only be very

similar to but rarely exactly identical to the current segment. Support vector regres-

sion (SVR) provides a way to fill in the gap of the similarity of the current segment

amongst the best neighbors. In general, SVR is able to enhance the prediction slightly

comparing to simply using mean value as predicted value.

SVR is to obtain a regression line by solving an optimization problem. The

advantages of support vector regression are the nonlinearity of regression line, the

ability of handling huge dimensions and its robustness to outliers. Due to these

strengths, SVR provides satisfactory respiratory motion prediction.

Figure 3.19 illustrates a simple example of SVR. The middle line is the regression

line and the upper and lower lines are the lines passing through support vectors. The

insensitive parameter, , is to adjust the coarseness of regression. Smaller epsilon

gives finer regression line. And, the slack variable, ξ, is to allow excluding outliers. A

regularization parameter, C, is introduced to control the cost of introducing slacks.

By choosing a kernel function, Φ, and using the obtained best neighbors to train

for the following SVR function. The weights, w, can be obtained by optimization

algorithms.

y(t + h) = wT µt (3.27)

Prediction is then done by in This can be formulated as the following optimization

problem [16, 35]:


L
1 X
min kwk2 + C (ξi + ξi∗ )
w,b 2
i=1

s.t.yi+δ − wT Φ(ui ) − b ≤  + ξi (3.28)


wT Φ(ui ) + b − yi+δ ≤  + ξi∗

ξi , ξi∗ ≥ 0, i = 1, . . . , L.

74
Note that the equations satisfy KKT conditions. We can introduce Lagrange multi-

pliers α, α∗ , η and η ∗ ≥ and rewrite the problem as follows:


L L
1 X X
L = kwk2 + C (ξi + ξi∗ ) − (ηi ξi + ηi∗ ξi∗ )
2 i=1 i=1
L
X
− αi ( + ξi − yi+δ + wT Φ(ui ) + b) (3.29)
i=1
L
X
− αi∗ ( + ξi∗ + yi+δ − wT Φ(ui ) − b)
i=1

From the saddle point condition, the partial derivatives of L with respect to the primal

variables (w, b, ξi , ξi∗ ) have to vanish for optimality.


l
X
∂b L = (αi∗ − αi ) = 0
i=1
l
X (3.30)
∂w L = w − (αi − αi∗ )ξi = 0
i=1

∂ξi∗ L = C − αi∗ − ηi∗ = 0

Substituting equations (3.30) into equation(3.29) yields the dual optimization prob-

lem: 
 − 1 Pl (αi − α∗ )(aj − a∗ )hxi , xj i

2 j=1 i j
maximize Pl (3.31)
 P l
 − j=1 (αi + αi∗ ) + j=1 yi (αi − αi∗ )
Pl
subject to j=1 (αi − αi∗ ) = 0 and αi , αi∗ ∈ [0, C]

By solving equation 3.31, we obtain the regression function (equation 3.27) and

input the current segment and the best neighbors into that function to do prediction.

The proposed Right-Aligned Pattern VBN-Based SVR prediction is denoted as RPKS

in the rest of the paper.

75
Figure 3.19: Illustration of support vector regression with insensitive parameter  and
slack variable ξ.

3.3.4 Comparison for the Prediction Performance of RPKM and Some State-Of-

The-Art Methods

The followings are the comparisons for the prediction performance of RPKM

and the latest state-of-the-art methods. Wavelet-based Multiscale Autoregres-

sion (wLMS) and Support Vector Regression prediction (SVRpred) are con-

cluded as the best methods by a survey conducted by Ernst [15, 30, 16, 24] in 2013.

TVSAR is a method developed by Ichiji [27] also published in 2013. They all

claim these methods are the best. In addition, Seasonal ARIMA is also added to the

comparison as most people are familiar to this method.

Data Acquisition and Experimental Settings Time series of abdominal dis-

placement of 27 lung and liver cancer patients were collected with the Real-time

Position ManagementTM (RPM)(Varian Inc., Santa Clara, CA) infrared camera and

reflective marker block system during their PET/CT examination. The time series

serves as a respiratory motion surrogate. [38]

76
The use of data was approved by the appropriate Institutional Review Board

in compliance with the Health Information Privacy and Portability Act. [38]

The sampling rate of respiratory traces was 30 Hz. The duration of data collec-

tion is from 15 to 45 minutes. The respiratory motion traces of 27 patients demon-

strate very high individuality.

In the data, 60% of it is used for training; 20% is used for testing and the

remaining is for testing.

For TVSAR and wLMS, they do not need training. Prediction directly started

at the testing set.

For the experiment of PKRM and PKRS, the threshold for obtaining the best

neighbors is set to be 0.95.

For RPKS and SVRpred, we consider −212 , −211 , . . . , 212 for kernel parameter,

γ, and 0, 0.01, 0.02, . . . , 0.1 for insensitive zone, , and max(|ȳ + 3σy |, |ȳ − 3σy ) for

regularization parameter, C.

Prediction Performance of RPKM, RPKS and the latest state-of-the-art

methods Table 4.2 shows the prediction performances of RPKM, res TVSAR,

wLMS, SVRpred and SARIMA. And, Figure 3.20 shows the box plots of the predic-

tion performance of the proposed methods and the current state-of-the-art methods.

Even though we only consider 3 ratios and the ratio is fixed for both windows, we

still can see a little improvement by using adaptive windows. By considering opti-

mizing the window size for individual patient, we expect that the performance would

be further improved.

Among the state-of-the-art methods, wLMS performs very well in short term

prediction and res TVSAR outperforms wLMS for long term prediction. Except

SVRpred, all other methods perform better than SARIMA.

77
Finally, for RPKM and RPKS, it is very obvious that they outperforms all

other methods significantly. Also, based on the results, RPKS is slightly better than

RPKM.

Table 3.2: The prediction performance metrics, mean and standard deviation of R-
squares, of the proposed methods and the state-of-the-art of respiratory motion pre-
diction methods on 27 patients
Prediction horizon 1 5 10 15 20 25 30
RPKM mean 0.998 0.976 0.918 0.831 0.728 0.620 0.523
std 0.001 0.018 0.052 0.095 0.141 0.179 0.206
RPKS mean 0.998 0.978 0.920 0.836 0.732 0.624 0.523
std 0.002 0.018 0.053 0.093 0.132 0.167 0.196
res TVSAR mean 0.964 0.834 0.684 0.462 0.229 0.013 -0.146
std 0.088 0.378 0.393 0.436 0.487 0.462 0.454
wLMS mean 0.996 0.880 0.648 0.386 0.131 -0.083 -0.233
std 0.005 0.322 0.487 0.527 0.535 0.526 0.520
SVRpred mean 0.908 0.738 0.639 0.347 0.029 -0.154 -0.323
std 0.044 0.075 0.099 0.164 0.324 0.323 0.359
SARIMA mean 0.979 0.846 0.608 0.231 -0.053 -0.292 -0.414
std 0.019 0.127 0.281 0.469 0.466 0.475 0.479

78
Prediction performance with prediction horizon h = 1 steps Prediction performance with prediction horizon h = 5 steps
1 1

0.998 0.95

0.996 0.9

0.994
0.85

0.992
0.8

0.99
RPKM RPKS res_TVSAR wLMS SVRpred SARIMA RPKM RPKS res_TVSAR wLMS SVRpred SARIMA

(a) A close view of Prediction horizon h=1 (b) Prediction horizon h=5
Prediction performance with prediction horizon h = 10 steps Prediction performance with prediction horizon h = 15 steps
1 1

0.9
0.9
0.8

0.8 0.7

0.6
0.7
0.5

0.6 0.4

0.3
0.5
RPKM RPKS res_TVSAR wLMS SVRpred SARIMA RPKM RPKS res_TVSAR wLMS SVRpred SARIMA

(c) Prediction horizon h=10 (d) Prediction horizon h=15


Prediction performance with prediction horizon h = 20 steps Prediction performance with prediction horizon h = 25 steps
1 1

0.8 0.8

0.6
0.6
0.4
0.4
0.2

0.2 0

−0.2
0
RPKM RPKS res_TVSAR wLMS SVRpred SARIMA RPKM RPKS res_TVSAR wLMS SVRpred SARIMA

(e) Prediction horizon h=20 (f) Prediction horizon h=25


Prediction performance with prediction horizon h = 30 steps
1

0.5

−0.5
RPKM RPKS res_TVSAR wLMS SVRpred SARIMA

(g) Prediction horizon h=30

Figure 3.20: Prediction performance of RPKM, RPKS and the state-of-the-art meth-
ods for prediction horizons h=20 to h=30

79
Prediction Performance of RPKM With and Without Adaptive Ratio Ta-

ble 3.3 shows the prediction performances of RPKM and RPKM with adaptive ratio.

And, Figure 3.21 shows the box plots of the prediction performance. Based on the

experimental result, adaptive window enhances the prediction accuracy of RPKM.

Table 3.3: The prediction performance metrics, mean and standard deviation of R-
squares, of the proposed approaches with and without adaptive ratio on 27 patients
Prediction horizon 1 5 10 15 20 25 30
RPKM mean 0.998 0.976 0.918 0.831 0.728 0.620 0.523
std 0.001 0.018 0.052 0.095 0.141 0.179 0.206
RPKM(without adaptive ratio) mean 0.998 0.976 0.916 0.827 0.721 0.612 0.517
std 0.001 0.019 0.054 0.096 0.142 0.180 0.206

3.4 Pattern-Based Variant-Best-Neighbors Prediction Using Orthogonal Polynomi-

als Approximated Respiratory Motion Time Series

Directly using raw data for pattern matching works as long as the signal is clean

with little noise. However, the quality of medical devices varies from one to another.

Some systems may have more noise than others. So, it is desirable to find a robust

method which is able to cope with data with more noise in order to have consistent

performance.

Besides, using raw data to build up the pattern libraries consumes a lot of space.

The higher the sampling rate is, the finer the signal can be obtained, however, which

also means a larger size of the library. Sparseness is a very popular topic in data

mining. Using reduced signal to represent the original signal can usually speed up

the computation and also makes the system more expandable.

The method of using orthogonal polynomials approximation for the pattern-

based variant best neighbors time series prediction is named as OPPRED which

80
Prediction performance with prediction horizon h = 1 steps Prediction performance with prediction horizon h = 5 steps
1 1

0.998 0.95

0.996 0.9

0.994
0.85

0.992
0.8

0.99
RPKM(without adaptive ratio) RPKM RPKM(without adaptive ratio) RPKM

(a) Prediction horizon h=1 (b) Prediction horizon h=5


Prediction performance with prediction horizon h = 10 steps Prediction performance with prediction horizon h = 15 steps
1 1

0.9
0.9
0.8

0.8 0.7

0.6
0.7
0.5

0.6 0.4

0.3
0.5
RPKM(without adaptive ratio) RPKM RPKM(without adaptive ratio) RPKM

(c) Prediction horizon h=10 (d) Prediction horizon h=15


Prediction performance with prediction horizon h = 20 steps Prediction performance with prediction horizon h = 25 steps
1 1

0.8 0.8

0.6
0.6
0.4
0.4
0.2

0.2 0

−0.2
0
RPKM(without adaptive ratio) RPKM RPKM(without adaptive ratio) RPKM

(e) Prediction horizon h=20 (f) Prediction horizon h=25


Prediction performance with prediction horizon h = 30 steps
1

0.5

−0.5
RPKM(without adaptive ratio) RPKM

(g) Prediction horizon h=30

Figure 3.21: Prediction performance of RPKM and RPKM(without adaptive ratio)


for prediction horizons h=1 to h=30

81
follows the same structure of RPKM as shown in Figure 3.22 except that the data is

converted into OPs approximations.

Figure 3.22: The general approach of the proposed pattern-based Variant-Best-


Neighbors prediction by using orthogonal polynomials approximated respiratory mo-
tion time series

3.4.1 Orthogonal Polynomials Appximation

Fuchs et al [39] proposed a method of online segmentation of time series based on

Legendre orthogonal polynomials (OPs) least-squares approximations. The method

originally intended for time series segmentation but it shows nice properties on the

application of time series pattern matching.

A time series consisting of real-valued samples yt with t = 0, ..., N with sampling

rate, s, can be modeled by a parameterized function f (x) : R → R. Here, we assume

that f (x) is linearly dependent on a parameter vector w with elements wk ∈ R(k =

0, . . . , K). Note that we do not claim that f (x) is a linear function in x. More
82
concretely, we assume that f is a linear combination of K + 1 (linear or nonlinear)

so-called basis functions fk :


K
X
f (x) = wk · fk (x) (3.32)
k=0

These basis functions may be polynomials, wavelets, sigmoid functions, or si-

nusoidal functions, for instance.

We may write the values of the K + 1 basis functions for the N + 1 points in

time x0 , . . . , xN into a matrix


 
 f0 (x0 ) · · · fK (x0 ) 
 . .. .. 
F= . . (3.33)
 . .  
 
f0 (xN ) · · · fK (xN )

If we combine the N + 1 samples of the overall time series into a vector y with

elements yn , the linear least-squares problem we want to solve can be denoted by

min kFw − yk (3.34)


w

with k · k being the euclidean norm. Its solution wLS can be found by setting the

derivative with respect to w to zero. First,

kFw − yk2 = hFw − y|Fw − yi


(3.35)
T T T T
= w F Fw − 2y Fw + y y

with h·|·i being the standard inner product in a real-valued vector space. Then,

∂kFw − yk2
= 2FT Fw − 2FT y (3.36)
∂w

leads to the least-squares solution

wLS = (FT F)−1 FT y (3.37)

83
provided that the matrix FT F is regular. Then, we have the pseudo-inverse of F, i.e.

wLS = F+ y (3.38)

where F+ = (FT F)−1 FT

In general, a real-valued pseudo-inverse A+ of a matrix A has the following

two properties (two of the four so-called Penrose criteria). First, (AA+ )T = AA+ .

Second, AA+ A = A and, consequently, AA+ AA+ = AA+ . Thus, the residuum

resulting from this least-squares approximation is

rLS = kFwLS − yk2

= yT (FF+ )T FF+ y − 2yT (FF+ )y + yT y (3.39)

= yT y − wTLS FT FwLS

where FT y = (FT F)F+ y = (FT F)wLS and (FF+ )T FF+ = FF+ With the term

(average) squared error, we refer to the residuum divided by the number of observed

samples:
1
2
σLS = (yT y − wTLS FT FwLS ) (3.40)
N +1
With 3.40, we could now determine the squared error once we have gotten the least-

squares solution for w.

In general, the solution of a linear least-squares problem is found by conducting

a QR decomposition or a singular value decomposition (SVD) of F.

Now, assume that the selected K +1 basis functions are orthogonal with respect

to an inner product yielding the value N


P ˙
n=0 fk1 (xn )fk2 (xn ) = 0 for any two basis

functions fk1 and fk2 with k1 6= k2 . This is the case for special kinds of polynomials

84
(see Section 3.2), for wavelet families, or the sinusoidal functions used for discrete

Fourier transforms, for instance. Then,


  
 f0 (x0 ) · · · f0 (xN )   f0 (x0 ) · · · fK (x0 ) 
 . .. 
..   ... .. .. 

FT F = 

.. . .  . . 
  
fK (x0 ) · · · fK (xN ) f0 (xN ) · · · fK (xN )
  (3.41)
2
kf0 k · · · 0 
 . .. .. 
= . .
 . .  
 
2
0 · · · kfK k

That is, FT F is a diagonal matrix which can be inverted if the elements in the

diangonal, which are the squared norms of the basis functions, are nonzero. This can

be easily guaranteed by an appropriate choice of basis functions. From 3.37, we then

get

wLS = (FT F)−1 FT y


  
1 1
 f0 (x0 ) kf0 k2 · · · f0 (xN ) kf0 k2   y0 
 .. .. ..  . 
  .. 
= . . .  
  
1 1 (3.42)
fK (x0 ) kfK k2 · · · fK (xN ) kfK k2 yN
 
PN yn
 n=0 kf0 k2 f0 (xn ) 
 .. 
= . 

P 
N yn
n=0 kfK k2 fK (xn )

That is, the least-squares solution can be written as a linear combination of the

training samples (cf. the dual representations of classifiers which are common in the

field of support vector machines, for instance).

85
With this result for wLS , with Equation 3.40, and with the definition wk =
PN yn
n=0 kfk k2 fk (xn ) for k = 0, . . . , K (elements of the solution vector wLS ), the squared
2
error σLS now becomes
N K
2 1 X
2
X
σLS = ( y − w2 kfk k2 ) (3.43)
N + 1 n=0 n k=0 k

Assume that, in a time window of length L+1, the values y0 , y1 , . . . , yL measured

at equidistant points in time x0 , x1 , . . . , xL must be approximated by a polynomial p

with degree K ≤ L(L ∈ N0 , K ∈ N0 ) in the least-squares sense.

It is well known that orthogonal polynomials with leading coefficient 1 in the

vector space P(R, R) of real polynomials on R fulfill the following three-term recur-

rence relation:

p−1 (x) = 0, (3.44)

p0 (x) = 1, (3.45)

pk+1 (x) = (x − ak )pk (x) − bk pk−1(x) (3.46)

For sliding window method, the approximation window [x0 , x1 , . . . , xM ] is lo-

cated at [0, 1, . . . , L]. In our study, we use Legendre orthogonal polynomials, as shown

in Figure , which fulfills the three-term recurrence relation with

L
ak = , (3.47)
2
k 2 ((L + 1)2 + k 2 )
bk = (3.48)
4(4k 2 − 1)
This provides a fast update procedure for generating the pattern libraries.

Due to the orthogonality between OPs, the coefficients of OPs are independent

to each other. Referring to equation 3.42, we can observe that the coefficients of the

corresponding OPs do not change even the order goes up. If the order goes down, the

86
Figure 3.23: Legendre Polynomials

coefficients of the OPs which have orders higher than the basis function, will equal

to zeros.

In the other words, only one approximation has to be done to obtain the approx-

imations equaled to or below that order. For instance, if we obtain the approximation

of order 20, we also obtain the approximations of order 19, 18, and so on.

This property is important to our application on time series pattern approxi-

mation because higher order of approximation does not necessarily give us the best

approximation. One example is shown in figure 3.24. Its coefficients are shown in

table 3.4.

Table 3.4: The coefficients of orthogonal polynomials up to order 20


w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10
0.267353 -4.71E-05 8.09E-06 5.60E-08 5.81E-10 6.94E-13 -2.07E-14 -7.45E-16 -8.39E-18 -9.75E-20 -3.60E-22
w11 w12 w13 w14 w15 w16 w17 w18 w19 w20
2.59E-24 1.51E-25 1.77E-27 1.28E-29 -5.73E-32 -1.90E-33 -2.98E-35 -2.14E-37 -1.33E-39 -5.04E-41

87
Figure 3.24: An example of OPs approximation such that the approximation of lower
order (order 18) is better than higher order (order 20)

The Advantages of Using Orthogonal Polynomials Approximations The

advantages of using orthogonal polynomials approximations can be summarized as

follows:

Efficiently determining the best order of OP Higher order does not necessar-

ily give better approximation. For orthogonal polynomials approximation, hav-

ing done only one approximation provides us all approximations of lower orders.

This unparalleled property efficiently determine the best order. In contrast, for

traditional approximation, we have to do the approximations for every order so

as to determine the best order.

Sparse data representation Currently, the sampling rate of our data is 30 Hz and

a typical respiratory cycle can take 6 minutes, so there can be 180 samples in

one cycle. Using Orthogonal Approximation with order 20, only 21 coefficients

are needed to be recorded.

Accurate approximation For about 1 to 2 cycles of respiratory motion, using or-

der 20, we are usually able to obtain very good approximation with R-squares

typically higher than 0.99.

Fast updating and reconstruction When the window length, N , is fixed, the or-

thogonal polynomials are fixed and can be easily calculated by using the three-
88
term recurrence relation shown in equation 3.47. Due to the orthogonal prop-

erties of the OPs, we obtain a close form equation for the coefficients shown in

equation 3.42.

Signal smoothing Least squares error is used for approximation. Usually the model

cannot model the time series perfectly. Tradeoffs have to be made in approxi-

mation and outliers are tended to be ignored.

Readiness for clustering and classification by using coefficients The result-

ing coefficients represents the weights of each orthogonal polynomial of the

approximation. Potentially, it can be used as one of features for clustering and

classification.

Distance Measure of Patterns Using the Coefficients of Orthogonal Poly-

nomials Approximation In this study, we still use R-squares as the similarity

metric of two patterns.

SSE
Sn = 1 − (3.49)
SStot

while the sum of square of errors (SSE) is


n
X n X
X K
2
|ei | = | ∆wk (fk (i) − fk (n))|2 (3.50)
i=1 i=1 k=1

By expansion, the SSE can be written in a close form:

SSE(w) = c11 ∆w12 + c12 ∆w1 ∆w2 + · · · + ckk ∆wk2 (3.51)

where cjk for j, k ∈ 1, . . . , K which are constants when the orthogonal polynomials,

fk , and n are fixed. Therefore, the similarity metric, Sn (w), only depends on the

orthogonal polynomials coefficients, w.

SSE(w)
Sn (w) = 1 − (3.52)
SStot
89
3.4.2 Prediction Results of RPKM and OPPRED

Table 3.5 shows that the performance of RPKM and OPPRED are very close

to each other. RPKM is just slightly better than OPPRED.

Table 3.5: The prediction performance of RPKM and OPPRED on 27 patients.


Prediction horizon 1 5 10 15 20 25 30
RPKM mean 0.998 0.976 0.918 0.831 0.728 0.620 0.523
std 0.001 0.018 0.052 0.095 0.141 0.179 0.206
OPPRED mean 0.998 0.975 0.914 0.824 0.720 0.612 0.516
std 0.001 0.019 0.056 0.100 0.143 0.179 0.204
res TVSAR mean 0.964 0.834 0.684 0.462 0.229 0.013 -0.146
std 0.088 0.378 0.393 0.436 0.487 0.462 0.454
wLMS mean 0.996 0.880 0.648 0.386 0.131 -0.083 -0.233
std 0.005 0.322 0.487 0.527 0.535 0.526 0.520
SVRpred mean 0.908 0.738 0.639 0.347 0.029 -0.154 -0.323
std 0.044 0.075 0.099 0.164 0.324 0.323 0.359
SARIMA mean 0.979 0.846 0.608 0.231 -0.053 -0.292 -0.414
std 0.019 0.127 0.281 0.469 0.466 0.475 0.479

From Figure 3.25b to Figure 3.25e, RPKM and OPPRED significantly out-

perform all of other methods in all prediction horizons. Except SVRpred, Seasonal

ARIMA does not do well comparing to other methods which are dedicated to the

respiratory motion time series prediction problem.

90
Prediction performance with prediction horizon h = 1 steps Prediction performance with prediction horizon h = 1 steps
1 1

0.998
0.95
0.996

0.994
0.9

0.992

0.85 0.99
RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA

(a) Prediction horizon h=1 (b) A close view for prediction horizon h=1
Prediction performance with prediction horizon h = 5 steps Prediction performance with prediction horizon h = 10 steps
1 1

0.9
0.95

0.8
0.9

0.7
0.85

0.6
0.8
0.5
RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA

(c) Prediction horizon h=5 (d) Prediction horizon h=10


Prediction performance with prediction horizon h = 15 steps Prediction performance with prediction horizon h = 20 steps
1 1

0.9
0.8
0.8

0.7 0.6

0.6
0.4
0.5

0.4 0.2

0.3
0
RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA

(e) Prediction horizon h=15 (f) Prediction horizon h=20


Prediction performance with prediction horizon h = 25 steps Prediction performance with prediction horizon h = 30 steps
1 1

0.8

0.6 0.5

0.4

0.2 0

−0.2
−0.5
RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA RPKM OPPRED res_TVSAR wLMS SVRpred SARIMA

(g) Prediction horizon h=25 (h) Prediction horizon h=30

Figure 3.25: Prediction performance of RPKM, OPPRED, res TVSAR, wLMS, SVR-
pred and SARIMA

91
Two Examples of Prediction of the Proposed Methods Figure 3.26 and 3.27

shows two prediction examples to visually demonstrate how exactly well the method

does. The solid blue line represents the observation, red line represents OPPRED

and black dotted line represents RPKM.

op_mea
n
r
aw_mea
n

Figure 3.26: Prediction results of Patient 9 with h=15

92
op_mea
n
r
aw_mea
n

Figure 3.27: Prediction results of Patient 2 with h=15

Weighted-Pattern-Based Variant-Best-Neighbors Prediction Using Weighted

Orthogonal Polynomials Approximated Respiratory Motion Time Series

The relative importance of different parts of a time series segment varies. Figure 3.28

shows an example of approximation errors. Without weights, the following approxi-

mation errors appear to be the same, so we propose a weighted orthogonal polynomials

pattern matching to distinguish their difference.

In local scale of respiratory motion time series, the latest data is more important

than the older data, recalling the correlation of errors before and after time tλ of the

best neighbors as an example. Referring back to Figure 3.15, we know that the error

close to and before the referenced time is correlated to the error after the referenced

time. As we desire this kind of flexibility in our respiratory motion prediction problem,

we generate an idea of adding weights to the errors of time series.

To improve our algorithm, we propose adding weights to either or both the

errors of the polynomials approximation and the pattern matching.

93
Figure 3.28: This example shows that even the two time series have the same amount
of error but the occurrences of the errors can be very different. The above plot shows
that the two patterns match very well in the older data (left) but do not match well
in the newest data. Therefore, for prediction, we would prefer the below one.

Many distance functions have been developed throughout the history of time

series studies. Lp − norm which is a common family of distance measurement, has

the following definition.

Definition 3.1 Lp − norm : Given two time series R and S of the same length N,

the Lp − norm distance between R and S is:

v
u N
uX
p
Lp − norm(R, S) = t (ri − si )p (3.53)
i=1

Even R-squares can be seemed as a variant of euclidean distance (p=2) when we

compare to a fixed referenced time series which is the current pattern in our method.

94
So, adding a weight to our similarity metric, equation 3.52, can be generalized

as below.

Definition 3.2 The weighted Lp − norm is then defined as:

v
u N
uX
p
Lp − norm(R, S, W ) = t wi (ri − si )p (3.54)
i=1

where wi is the weight for the distance of the pair of the ith samples.

3.4.3 Weighted Orthogonal Polynomials Approximations

The conventional orthogonal polynomials approximation considers all points of

a time series are equally important and the regression is to approximate the whole time

series with a minimum overall error. However, flexibility can be given to orthogonal

polynomials approximation. In our study, we desire a more accurate approximation

in the latter data than in the older ones. To achieve this, weights, b, are added to

the approximation error during regression. Equation 3.55 then becomes:

(Fw − y)T b(Fw − y) = (Fw − y)T (bFw − by)


(3.55)
= wT FT bFw − 2yT bFw + yT by

Then,

∂kFw − yk2
= 2FT bFw − 2FT by (3.56)
∂w
Putting 2FT bFw − 2FT by = 0, we have

2FT bFw = 2FT by (3.57)

95
The coefficients, wLS can then be written as:

wLS = (FT bF)−1 FT by


   
1 1
 f0 (x0 ) b0 kf0 k2 · · · f0 (xN ) bN kf0 k2  b0   y0 
√ √
 .. .. ..   ..  . 
  .. 
= . . . 
 .  
   
1 1 (3.58)
fK (x0 ) b0 kf k2 · · · fK (xN ) bN kfK k2
√ √ bN yN
K

PN yn √bn
 
 n=0 kf0 k2 f0 (xn ) 
 .. 
= . 

P √ 
N yn bn
n=0 kfK k2 fK (xn )

Under the current framework, the best neighbors are found by a 2-step pat-

tern searching. An alternative is to give weights for the two windows and combine

their similarity values with weights. This variant can give a very similar result to the

proposed multiple steps pattern searching method and it is more integrated mathe-

matically.

It is desired to keep the most of the data to be accurate and only give relaxation

on the oldest data. Figure 3.29 demonstrates the weights considered in our study. In

validation process, we can try multiple sets of weights and select the one giving the

best performance.

3.4.4 Weighted Time Series Pattern Matching

During pattern matching of the time series, we may want to emphasize the

importance of some part of the time series. Therefore, we propose a weighted time

series pattern matching. Similar work has been done by Jeong [40] who proposed

weighted dynamic time warping (WDTW). Dynamic time warping (DTW) is a kind of

distance measure for time series. Similarly, for weighted time series pattern matching,

weights are added to the time series to penalize the dissimilarity of the different part

96
observations
1 weight(short window)
weight(long window)

0.8
data & weight

0.6

0.4

0.2

0
0 20 40 60 80 100
time

Figure 3.29: The weights of the shorter window (black dotted) and the longer window
(red dotted)

of the pattern to achieve more flexibility on pattern matching. In respiratory motion

time series prediction, the latest data is intuitively more important than the older

data.

To implement weights into pattern matching, the computation of similarity of

time series segments, i.e. equation 3.49, is modified as


Pi=Lw
((un (i) − ū0 )wt )2
Sn = 1 − Pi=1
i=Lw
(3.59)
2
i=1 ((u0 (i) − ū0 )wt )

A Simulation Study on Noise-Added Time Series Data A simulation study is

done to validate whether OPPRED is robust to noisy data. In this study, an artificial

generated noise similar to Figure 3.31 are added to the real respiratory motion data
97
of the first 4 patients of 27 patients. One of the settings is to generate short sporadic

noise while another setting is to generate relatively longer and sparse noise as shown

in Figure 3.31.

Table 3.6 shows the mean and standard deviation of the prediction performance

(R-squares) of RPKM and OPPRED on noise-added time series data with prediction

horizon ranging from 1 to 30. It shows that if the time series is noisy, OPPRED

will perform a little bit better than RPKM. In addition to the data sparsity of OP-

PRED, these makes OPPRED a suitable algorithm for respiratory motion time series

prediction.

Table 3.6: The prediction performance metrics, mean and standard deviation of R-
squares, of the proposed approaches on first 4 patients noise-added respiratory motion
time series.
Prediction horizon 1 5 10 15 20 25 30
RPKM(noise) mean 0.975 0.946 0.876 0.783 0.674 0.565 0.479
std 0.010 0.025 0.062 0.104 0.154 0.191 0.216
OPPRED(noise) mean 0.973 0.947 0.879 0.791 0.686 0.578 0.487
std 0.011 0.026 0.061 0.101 0.146 0.180 0.206

3.5 Discussion and Conclusion

In this study, we developed a pattern matching based semi-periodic time series

prediction framework and applied it on respiratory motion time series prediction.

In radiotherapy, system latencies need to be compensated for accurate irradiation

during treatment. Accurate respiratory motion prediction can minimize the damage

of normal body tissues and important human organs.

Pattern matching can effectively utilize the existing information from the data.

Similar pattern demonstrates similar trends in the response. The pattern recognition

98
Prediction performance with prediction horizon h = 1 steps Prediction performance with prediction horizon h = 5 steps
1 1

0.99
0.95
0.98

0.97 0.9

0.96

0.85
0.95

0.94
RPKM(with noise1) OPPRED(with noise1) RPKM(with noise1) OPPRED(with noise1)

(a) Prediction horizon h=1 (b) Prediction horizon h=5


Prediction performance with prediction horizon h = 10 steps Prediction performance with prediction horizon h = 15 steps
1 1

0.95
0.9
0.9
0.8
0.85

0.8 0.7

0.75
0.6
0.7

0.65 0.5
RPKM(with noise1) OPPRED(with noise1) RPKM(with noise1) OPPRED(with noise1)

(c) Prediction horizon h=10 (d) Prediction horizon h=15


Prediction performance with prediction horizon h = 20 steps Prediction performance with prediction horizon h = 25 steps
1 1

0.9
0.8
0.8

0.7
0.6
0.6

0.5 0.4

0.4
0.2
RPKM(with noise1) OPPRED(with noise1) RPKM(with noise1) OPPRED(with noise1)

(e) Prediction horizon h=20 (f) Prediction horizon h=25


Prediction performance with prediction horizon h = 30 steps
1

0.8

0.6

0.4

0.2

0
RPKM(with noise1) OPPRED(with noise1)

(g) Prediction horizon h=30

Figure 3.30: Prediction performance of RPKM, OPPRED on noise-added data

99
0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
3.36 3.38 3.4 3.42 3.44 3.46 3.48
4
x 10

Figure 3.31: An example of noise-added time series in the simulation study. Simulated
noise is added to respiratory time series data of a patient

process is enhanced by combining with statistical and feature analysis which help to

obtain better matched patterns and remove undesired patterns. The experimental

results show that the prediction of the proposed method is very accurate and it is

also robust to different kind of patients. For h = 5, most of the patients are able

to attain 0.95 R-squares. This method should contribute greatly to tumor position

prediction in order to help the cancer patients to enhance their life quality.

Looking at the autocorrelation of height and interval, we know that respiratory

motion time series are autoregressive. So the values in next cycle may be determined

by previous cycles. This theoretically supports our method which uses similar patterns

to predict future values.

In the simulation study of noise-added data, it shows that OPPRED is more

robust to noise and drifting than RPKM. Since the proposed method design to do

prediction on respiratory motion time series which demonstrates tremendous individ-

uality, it shows its potential on other applications that demonstrate the characteristics

of semi-periodic time series.

100
3.5.1 Future Studies

In the future, improvements are possible in the following aspects.

Finding Better Pattern Matching Methods The current 2-window design is

intended to consider both short and long range pattern of the time series.

In the future, we will do more experiments on the weighted pattern matching

and the weighted orthogonal approximations which shows great potential to improve

our methods, will be discussed in detail in section 3.4.3 and 3.4.4.

Finding Better Distance Measurement Currently, R-squares is used for simi-

larity measurement of the time series. It provides a nice control mechanism on the

time series similarity by providing a universal measurement. However, R-squares is

sensitive to window length and is not perfect. In the future, we may figure out a

better similarity measurement.

Generalizing the current 2-window design to n-window design The 2-

window design can be in fact seemed as a special case of n-window design with weight

zeros on other windows. Theoretically, the windows can be infinitively long. How-

ever, obviously, the resulting patterns found by longer windows are less important

than those with the shorter ones. For instance, in respiratory motion, we already

know that the patterns of half cycle are more predictive than the patterns of one

cycle.

Optimizing Parameters Intuitively, not every part of a time series segment has

the same importance on obtaining the most effective approximations or the best

neighbors of time series patterns.

101
Generally, our method has 4 ways to seek for improvement: the first one is

to improve the pattern matching process; the second one is to improve the way of

how we use the obtained best matching patterns for prediction; the third one is to

optimize the parameters of the methods, such as the optimal window sizes of the two

windows and the weights in pattern matching and orthogonal approximation; and the

final one is to optimize the parameters involving in our algorithm, such as window

sizes and parameters of SVR..

102
CHAPTER 4

Pattern Recognition and Classification of Multivariate Time Series Signals: EEG

Study of Musicians and Non-Musicians

4.1 Introduction

There has been much interest in the beneficial effects of musical training on

cognition. Previous studies have indicated that musical training was related to better

working memory and that these behavioral differences were associated with differ-

ences in neural activity in the brain. However, it was not clear whether musical

training impacts memory in general, beyond working memory. By recruiting pro-

fessional musicians with extensive training, we investigated if musical training has

a broad impact on memory with corresponding electroencephalography (EEG) sig-

nal changes, by using working memory and long-term memory tasks with verbal and

pictorial items. Behaviorally, musicians outperformed on both working memory and

long-term memory tasks. A comprehensive EEG pattern study has been performed,

including various univariate and multivariate features, time-frequency (wavelet) anal-

ysis, power-spectra analysis, and deterministic chaotic theory. The advanced feature

selection approaches have also been employed to select the most discriminative EEG

and brain activation features between musicians and non-musicians. High classifica-

tion accuracy (more than 95%) in memory judgments was achieved using Proximal

Support Vector Machine (PSVM). For working memory, it showed significant differ-

ences between musicians versus non-musicians during the delay period. For long-term

memory, significant differences on EEG patterns between groups were found both in

the pre-stimulus period and the post-stimulus period on recognition. These results

103
indicate that musicians memorial advantage occurs in both working memory and long-

term memory and that the developed computational framework using advanced data

mining techniques can be successfully applied to classify complex human cognition

with high time resolution.

4.2 Methodology

4.2.1 Data Acquisition and Experimental Settings

Participants 36 participants were initially recruited for the study. Four partici-

pants were excluded for having negative d values on the long-term memory task, two

were excluded for failing to follow directions, and one was excluded as a behavioral

outlier (more than 3 SDs from mean long-term memory performance). In total, 29

subjects were included in the analyses; 14 (5 female) were professional musicians with

>10 years of experience (M = 22.9 years of experience) and 15 were nonmusicians

(8 female) with no musical training. Informed consent was obtained from all partic-

ipants in accordance with the experimental protocol approved by the University of

Texas Institutional Review Board.

Design of the Experiments Participants completed a study session followed by

a test session involving words and pictures as stimuli. Stimuli were presented visually

on a computer and all responses were made using the keyboard. During the study

session, participants were presented with pairs of stimuli, one at a time. Each study

trial began with a fixation cross (250 ms), the first stimulus (1000 ms), a blank screen

(5000 ms), the second stimulus (2500 ms or until a response), and finally a blank

screen (1000 ms). Upon presentation of the second stimulus, participants made a

judgment of whether the second stimulus was the same as the first (Figure 4.1a).

104
A few minutes following the study session, participants memory was tested.

During this test session, stimuli presented during study were presented again along

with new stimuli that had not been studied. Further, we only tested participants

memory on stimuli that had only been presented once. Therefore, only stimuli pre-

sented on trials that were different during the study session (i.e. trials on which

the second stimulus was different from the first) were presented during test. Each

test trial began with a fixation (250 ms), followed by a stimulus (3000 ms or until

a response), and then a blank screen (1250 ms). Upon presentation of the stimulus,

participants made a memory judgment which included a rating of how confident they

were in their memory (Figure 4.1b). They were allowed to make three responses:

remember with low confidence, remember with low confidence, remember with high

confidence, or new.

Word and picture stimuli were blocked for both study and test phases, such

that each participant was presented with a block of word trials followed by a block of

picture trials (or vice versa). Whether or not participants were presented with words

or pictures first was randomly determined for each participant.

Types of Stimuli Participants were presented with pictures of complex scenes and

words. During the study session, participants completed 96 trials of pictures (32

same, 64 different) and 96 trials of words (32 same, 64 different). Given that each

trial contained two stimulus presentations, participants studied a total of 128 pictures

and 128 words from different trials. These stimuli were used to test long-term memory

during the test session. During the long-term memory task, participants completed

192 trials of pictures (128 studied, 64 new) and 192 trials of words (128 studied, 64

new).

105
(a) A. Study

(b) B. Test

Figure 4.1: Schematic of experimental paradigm. A1 to A5) During study period,


participants were asked to judge whether the second stimulus matched the first. B1 to
B3) During test period, participants made memory judgments to stimuli while rating
their confidence. Low represents remember with low confidence, High represents
remember with high confidence, and New represents a judgment where participants
thought the stimulus was not studied.

EEG data EEG data were collected during both study and test sessions using

the Brain Vision ActiChamp 32 channel system and recorded using the Pycorder

software. Electrode positions followed the 10-20 system and included Fz, Cz, Pz, Oz,

Fp1, Fp2, F3, F4, F7, F8, Fc1, Fc2, Fc5, Fc6, Ft9, Ft10, T7, T8, C3, C4, Cp1, Cp2,

Cp5, Cp6, Tp9, Tp10, P3, P4, P7, P8, O1, and O2 (Figure 4.2). During recording,

data were sampled at 1000 Hz and filtered between .01 and 100 Hz. Offline, data

were high-pass filtered with a 0.1 Hz Butterworth filter, downsampled to 256 Hz,

and referenced to the average of the mastoids (TP9 and TP10). Post-stimulus ERPs

with a 1000 ms duration were extracted and were baseline-corrected with respect to a

106
200 ms prestimulus baseline. Visual inspection was then used to remove epochs that

contained eye blinks and movement artifacts.

Figure 4.2: The Map of the channel locations

107
4.2.2 Artifacts Removal

Brain signals often contain significant artifacts that lead to major problems in

signal analysis, when the activity due to artifacts has a higher amplitude than the one

due to neural sources. The common sources of artifacts include eye movements, mus-

cle contractions, electric devices interference [41]. Independent Component Analysis

(ICA) has been successfully applied for artifacts removal in many studies. The basic

idea is to decompose the brain data into independent components, determine the ar-

tifacted components using pattern and source localization analysis, and reconstruct

the brain signals by excluding those artifacted components. However, linking com-

ponents to artifact sources (e.g., eye blinking, muscle movements) remains largely

user-dependent. In this study, we employed a recently developed automatic ICA-

based algorithm, called ADJUST [42], for signal artifact removal. ADJUST applies

stereotyped artifact-specific spatial and temporal features to identify independent

components of artifacts automatically. These artifacts can be removed from the data

without affecting the activity of neural sources [42]. The data analysis in the following

is based on the ’cleaned’ data after artifact removal.

4.2.3 Signal Feature Extraction

We extensively investigated features from the collected physiological signals.

Four groups of feature extraction techniques were employed to capture signal charac-

teristics that may be relevant to assess memory workload. They were signal power,

statistical, morphological, and wavelet features. For a data epoch with n channels,

we first extracted features from signals at each channel, and then concatenated the

features of all the n channels to construct the feature vector of the data epoch. Let

X = {x1 , x2 , · · · , xm } denote a single-channel signal with m points, the feature ex-

traction of four groups of signal features are described as follows.


108
Figure 4.3: Artifact Removal Using ICA

Signal Power Features: Adopting the signal features used in a previous work

[43], we computed signal power for each channel in every nonoverlap 2-Hz intervals

from 4-40 Hz. The 18 power features provide finer signal power spectrum information

than the commonly used brain signal frequency bands, such as theta, alpha, beta,

and gamma bands.

Band Power Asymmetry: While it is well-known that emotion states is

associated with the power asymmetry of the EEG signals, it is unknown that if power

asymmetry can serve as an indicator of musicians and non-musicians. We studied

the power asymmetry in 2 ways: inter-hemispheric asymmetry or intra-hemispheric

asymmetry, as shown in the following channel locations map. [44]

Statistical Features: Four most widely used statistical measures, mean, vari-

ance, skewness, and kurtosis, to characterize the distribution of signal amplitudes. In

particular, mean is the averaged signal amplitude, and variance measures the signal

variability to the mean. The high order statistics skewness quantifies the extent of

109
(a) Topography of the event (b) Topography of the event (c) Topography of the event
of Eye Blink of Vertical Eye Movement of Horizontal Eye Movement

(d) Topography of the event (e) Topography of the event


of Generic Discontinuity of Neural

Figure 4.4: Topographies for ICA-Based Artifact Removal

110
Figure 4.5: Group of Channels for Inter- and Intra-hemispheric power band asymme-
try. For Inter-hemispheric power band asymmetry, the value is calculated by pairs of
same colors over another hemisphere. For Intra-hemispheric power band asymmetry,
the value is calculated by pairs of different colors within the same hemisphere.

111
the distribution leans to one side of the mean, and kurtosis measures the ‘peakedness’

of the distribution.

Morphological Features: Three morphological features were extracted to de-

scribe morphological characteristics of a single-channel signal. These features showed

their usefulness in our previous studies of brain signals [45, 46]. A brief description

of the morphological features is given in the following.

• Curve Length: also known as ‘line length’ which was first proposed by Olsen et

al. [47]. Curve length is the sum of distances between successive points, given

by m−1
X
|xi+1 − xi |. (4.1)
i=1
Since curve length increases as the signal magnitude or frequency increases, it

is a measure of amplitude-frequency variations of a signal. It has been used in

many brain signal studies, such as epileptic seizure detection [48], stimulation

responses of the brain [49].

• Number of Peaks: a widely used characteristic to measure the overall frequency

of a signal. The number of peaks in a signal X can be calculated by


m−2
1X
max{0, sgn(xi+2 − xi+1 ) − sgn(xi+1 − xi )}. (4.2)
2 i=1

• Average Nonlinear Energy: nonlinear energy was first proposed by Kaiser [50].

It has been found that the nonlinear energy is sensitive to spectral changes.

Thus, it is useful to capture spectral information of a signal [51]. The average

nonlinear energy of the single-channel signal X is computed as


m−1
1 X 2
x − xi−1 xi+1 . (4.3)
m − 2 i=2 i

Time-Frequency Features: Wavelet transform (WT) is a powerful tool to

perform time-frequency analysis of signals. The fundamental idea of WT is to rep-

resent a signal by a linear combination of a set of functions obtained by shifting or


112
dilating a particular function called mother wavelet [52]. The WT of a signal X(t) is

defined as
t−b
Z
1
C(a, b) = X(t) √ Ψ( )dt (4.4)
R a a
where Ψ is the mother wavelet, C(a, b) are the WT coefficients of the signal X(t), a

is the scale parameter, and b is the shifting parameter. Continuous wavelet transform

(CWT) has a ∈ R+ and b ∈ R, and discrete wavelet transform (DWT) has a = 2j

and b = k2j for all (j, k) ∈ Z given the decomposition level of j. Since CWT

explores every possible scale a and shifting b, it is generally a lot more computationally

expensive than DWT. As a result, DWT is often used to perform time-frequency

analysis of a signal at different decomposition levels [53]. The DWT coefficients

provide a non-redundant and highly efficient representation of a signal in both time

and frequency domain. At each level of decomposition, DWT works as a set of band-

pass filters to divide a signal into two bands called approximations and details signals.

The approximations (A) are the low frequency components of the signal, and the

details (D) are the high-frequency components. Among different wavelet families, we

employed Daubechies wavelet as it is frequently used in physiological signal analysis

due to its orthogonality property and efficient filter implementation [54]. A 4-level

DWT decomposition was applied to the collected signals with the sampling rate of

128 Hz. Table 4.1 lists the decomposed signals A4, D4, D3, D2, D1, which roughly

corresponded to the commonly recognized brain signal frequency bands delta, theta,

alpha, beta, and gamma, respectively.

After the 4-level DWT decomposition, a set of wavelet coefficients were obtained

for each decomposed signals. To further decrease feature dimensionality, we employed

a popular measure called wavelet entropy (WE), which indicates the degree of multi-

113
frequency signal order/disorder in the signals [55]. To obtain wavelet entropy, the first

step is to calculate relative wavelet energy for each decomposition level as follows

Ej Ej
pj = = Pn , (4.5)
Etot j=1 Ej

where j is the resolution level, n is the number of selected resolution levels for analysis,

and n = 5 in this study. Ej is the wavelet power at decomposition level j. It is

calculated by the summation of the squared values of the wavelet coefficients at level

i. The relative wavelet energy pj can be considered as the power density of the

decomposed signal level j. It satisfies nj=1 pj = 1. According to Shannon entropy


P

[56] for analyzing and comparing probability distributions, the WE is defined similarly

by
n
X
WE = − pj × ln(pj ), (4.6)
j=1

where pj is the relative wavelet energy of resolution level j. The wavelet entropy

offers a suitable tool for characterizing the order/disorder of signals powers in the

five brain signal frequency bands (delta, theta, alpha, beta and gamma) during the

n-back task. For example, if relative wavelet energy at a resolution level i (e.g., alpha

band) is dominant over others, such as pi equals almost one, and all other relative

wavelet energies are almost zero. In this case, the wavelet entropy will be a very small

value near zero. On the other hand, the relative wavelet energies are almost equal for

all resolution levels, then the WE will reach its maximum value.

4.2.4 Feature Vector Classification Using Proximal Support Vector Machine (PSVM)

Classification Method In the experiments, we collected data from four difficulty

levels (0-, 1-, 2-, 3-back). A popular binary classification technique, support vector

machine (SVM), was employed to investigate the data separability at different mental

workload levels. SVM techniques have been successfully applied in many classification
114
Table 4.1: Frequency ranges and the corresponding brain signal frequency bands of
the five levels of signals by discrete wavelet decomposition.
Decomposed Level Frequency Range (Hz) Approximate Band
D1 32-64 Gamma
D2 16-32 Beta
D3 8-16 Alpha
D4 4-8 Theta
A4 0-4 Delta

problems [57, 58, 59, 60, 61]. The fundamental problem of SVM is to build an optimal

decision boundary to separate two categories of data. Let Y denote a n×k dimensional

feature vector for a multi-channel data session at certain difficulty level, where n is

the number of signal channels and k is the number of features of each channel. To

classify data with two workload levels, let l denote the sample class label and l = 1

denotes one workload level, and l = −1 means the other workload level.

Assume we have p sessions of level one denoted by S1 = {(Y1 , l1 ), (Y2 , l2 ), ..., (Yp , lp )},

and q sessions of level two denoted by S2 = {(Yp+1 , lp+1 ), (Yp+2 , lp+2 ), ..., (Yp+q , lp+q )}.

Each session is represented by a n × k dimensional feature vector. One can find

infinitely many hyperplanes in Rn×k to separate the two data groups. Based on

statistics learning theory (STL), a SVM selects a hyperplane which maximizes its

distance from the closest point from the samples. This distance is referred to as mar-

gin. The standard SVM formulation that maximizes the margin and minimizes the

training error is as follows:

Pp+q
minω,ξ,b { 12 kωk2 + C i=1 ξi : D(Y T ω + be) ≥ e − ξi }, (4.7)

where ω is the weight vector, and the slack variables ξ is introduced to measure the

degree of misclassification during training. The penalty cost C is used to control the

tradeoff between a large margin and a small prediction error penalty. Each column

of Y is an observation Yi , D is a diagonal matrix with class-label elements Dii equal


115
to 1 if Yi belongs to one class, or -1 otherwise. The vector e has all its elements equal

to one. The first term of the objective function in 4.7 is due to maximize the margin

of separation 2/kwk, and the second term measures how much emphasis is given to

the minimization of the training error.

Since the standard SVM classifiers usually require a large amount of computa-

tion time for training, the Proximal SVM (PSVM) algorithm was introduced Man-

gasarian and Wild [62] as a fast alternative to the standard SVM formulation. The

formulation for the linear PSVM is as follows:

minω,ξ,b { 21 (kωk2 + b2 ) + 21 Cξi T ξi : D(Y T ω + be) = e − ξi }, (4.8)

where the traditional SVM inequality constraint is replaced by an equality con-

straint. This modification changes the nature of the support hyperplanes (ω T Y + b =

±1). Instead of bounding planes, the hyperplanes of PSVM can be thought of as

‘proximal’ planes, around which the points of each class are clustered and which are

pushed as far apart as possible by the term (kωk2 +b2 ) in the above objective function.

It has been shown that PSVM has comparable classification performance to that of

standard SVM classifiers, but can be an order of magnitude faster [62]. Therefore,

we employed PSVM in this study.

Training and Evaluation A classification problem generally follows a two-step

procedure which consists of training and testing phases. During the training phase, a

classifier is trained to achieve the optimal separation for the training data set. Then

in the testing phase, the trained classifier is used to classify new samples with un-

known class information. The N-fold cross-validation is an attractive method of model

evaluation when the sample size is small. It is capable of providing almost unbiased

estimate of the generalization ability of a classifier [63]. For the 29 subjects, the total

116
number of data samples (trials) for session A and B are 128 and 386 respectively. We

designed a 2-fold cross-validation method to train and evaluate the SVM classifier.

To explore the differences of the responses of musicians and non-musicians un-

der various events, we separate the data into 5 and 3 epochs for session A and B

respectively, as shown in Figure 4.1. Based on the event markers of the EEG data,

we further define 23 conditions for each session. The following table lists all of the

conditions.

Table 4.2: A list of all comparison conditions of the experiments. For comparison con-
ditions 4 to 11, the naming structure was stimuli/grand truth/response. For conditions
12 to 23, the naming structure was stimuli/response to that stimuli in test session/if
it was the 1st or 2nd stimuli. For conditions 35 to 46, it was stimuli/confidence level
of having seen the stimuli/correctness
Group A Group B
condition event condition event
1 all samples 24 all samples
2 picture 25 picture
3 word 26 word
4 picture - same - same 27 picture - long term Low
5 picture - same - diff 28 picture - long term High
6 picture - diff - diff 29 picture - long term New
7 picture - diff - same 30 word - long term Low
8 word - same - same 31 word - long term High
9 word - same - diff 32 word - long term New
10 word - diff - diff 33 picture - correct
11 word - diff - same 34 word - correct
12 picture - long term Low - stim1 35 picture - low confidence - correct
13 picture - long term High - stim1 36 picture - high confidence -correct
14 picture - long term New - stim1 37 picture - new - correct
15 word - long term Low - stim1 38 picture - low confidence - wrong
16 word - long term High - stim1 39 picture - high confidence -wrong
17 word - long term New - stim1 40 picture - new - wrong
18 picture - long term Low - stim2 41 word - low confidence - correct
19 picture - long term High - stim2 42 word - high confidence -correct
20 picture - long term New - stim2 43 word - new - correct
21 word - long term Low - stim2 44 word - low confidence - wrong
22 word - long term High - stim2 45 word - high confidence -wrong
23 word - long term New - stim2 46 word - new - wrong

117
For each comparison group, we divided the corresponding data samples into 5

non-overlapping subsets. Each time we picked one subset out and trained the PSVM

classifier by the data samples of another set. The samples of the left-out subset were

considered as unknown samples to test the performance of the trained classifiers.

Repeating this procedure again for another set, the averaged prediction accuracy

over the 5-fold runs was used to indicate the degree of separability of the EEG signals

of musicians and non-musicians.

To achieve reliable feature selection, we employed an advanced feature selection

technique, called minimum redundancy maximum relevance (mRMR) [64], which

allows us to select a subset of superior features at a low computational cost in a high

dimensional space.

The basic idea of mRMR is to select the most relevant features with respect to

class labels while minimizing redundancy amongst the selected features. The mRMR

algorithm uses mutual information as a distance measure to compute feature-to-

feature and feature-to-class-label non-linear similarities.

For two features X and Y, p(X) and p(Y) are marginal probability functions,

and p(X, Y ) is the connected probability distribution while I(X, Y ) is the amount of

mutual information of a and b:


 
XX p(x, y)
I(X; Y ) = p(x, y) log , (4.9)
y∈Y x∈X
p(x) p(y)

The mRMR method aims to minimize redundancy (Rd) while maximizing rel-

evance (Re) amongst the features. The Rd and Re by the following definition:

1 X
Rd = I(i, j) (4.10)
|S|2 i,j∈S
1 X
Re = I(h, j) (4.11)
|S| i∈S

118
Where S is the set of features, h are the target class labels, and I(i, j) is the mu-

tual information between features i and j. The feature selection criterion combining

the above two constraints is the mRMR, for which the objective function of feature

selection can be defined by

φ(D, R) = D − R. (4.12)

An optimal subset of features are the ones that maximize the above mRMR objective

function.

4.3 Result Discussion

Table 4.3 shows the classification accuracy for 46 conditions and 8 epochs with

5-fold cross validation and 10 features selected by mRMR and without any ICA

artifacts removal. The classification accuracies mostly ranges from 60% to 85%. Some

conditions can go as high as 90%, such as condition 21 epoch A3 and condition 26

epoch B2. The highest accuracy which is 94.59% occurs in condition 30 and epoch

B1.

Table 4.4 shows the classification accuracy for 46 conditions and 8 epochs with

5-fold cross validation and 10 features selected by mRMR and with ICA artifacts

removal. The classification accuracies mostly ranges from 70% to 85%. Some con-

ditions can go as high as 90%, such as condition 2 epoch A4, condition 14 epoch 3,

condition 15 epoch A4 and more. The highest accuracy which is 97.30% occurs in

condition 20 and epoch A4. It is generally better than directly using the raw data

and without any ICA artifact removal.

To sum up, from the 2 classification settings we have tried, we found that 5-fold

cross validation and 10 features selected by mRMR with ICA artifacts removal gives

better results. And, Epoch A4 generally gives better classification result. Looking at

119
Table 4.3: The table of the classification accuracy for 46 conditions and 8 epochs
with 5-fold cross validation and 10 features selected by mRMR and without any ICA
artifacts removal.
Epoch Epoch
condition A1 A2 A3 A4 A5 condition B1 B2 B3
1 51.35 78.38 83.78 81.08 70.27 24 59.46 78.38 64.86
2 51.35 70.27 64.86 72.97 67.57 25 86.49 81.08 54.05
3 81.08 70.27 67.57 86.49 83.78 26 67.57 72.97 72.97
4 - - - - - 27 72.97 91.89 70.27
5 - - - - - 28 64.86 81.08 81.08
6 64.86 78.38 62.16 75.68 72.97 29 70.27 89.19 59.46
7 80.00 68.57 88.57 74.29 77.14 30 94.59 86.49 89.19
8 - - - - - 31 67.57 83.78 72.97
9 - - - - - 32 70.27 81.08 72.97
10 75.00 80.56 77.78 86.11 77.78 33 72.97 81.08 56.76
11 - - - - - 34 59.46 78.38 70.27
12 67.57 64.86 78.38 78.38 70.27 35 78.38 67.57 62.16
13 59.46 67.57 75.68 78.38 75.68 36 72.97 83.78 75.68
14 62.16 78.38 70.27 75.68 89.19 37 75.68 86.49 51.35
15 56.76 78.38 75.68 81.08 78.38 38 62.16 70.27 56.76
16 81.08 81.08 83.78 86.49 78.38 39 74.29 57.14 65.71
17 72.97 64.86 75.68 72.97 67.57 40 78.38 81.08 81.08
18 59.46 81.08 62.16 81.08 72.97 41 72.97 72.97 72.97
19 62.16 64.86 72.97 72.97 78.38 42 72.97 75.68 72.97
20 83.78 59.46 51.35 67.57 75.68 43 81.08 81.08 62.16
21 86.49 81.08 91.89 70.27 81.08 44 75.00 63.89 63.89
22 64.86 86.49 64.86 89.19 51.35 45 62.16 70.27 70.27
23 56.76 70.27 78.38 75.68 81.08 46 83.78 72.97 72.97

the selected features of the highest accuracy setting, we may find the major difference

of EEG signals between musicians and non-musicians under certain condition.

For epoch 4 and condition 20, the classifier selected F1, F8, F14 and F18 ex-

tensively. They are mean, variance, skewness, kurtosis, relative band power, wavelet

entropy and wavelet statistics.

Figure 4.6 shows the comparison for the EEG signals of 30 channels of musicians

and non-musicians for epoch B1 under condition 30. In this case, the PSVM classi-

fier reaches 97.30% classification accuracy. From the plots, we can also observe the

120
Table 4.4: The table of the classification accuracy for 46 conditions and 8 epochs with
5-fold cross validation and 10 features selected by mRMR and with ICA artifacts
removal
Epoch Epoch
condition A1 A2 A3 A4 A5 condition B1 B2 B3
1 86.49 64.86 72.97 81.08 62.16 24 56.76 67.57 62.16
2 78.38 78.38 75.68 91.89 67.57 25 48.65 78.38 67.57
3 72.97 75.68 78.38 70.27 62.16 26 86.49 64.86 62.16
4 - - - - - 27 59.46 83.78 78.38
5 - - - - - 28 78.38 78.38 70.27
6 83.78 81.08 86.49 81.08 72.97 29 81.08 72.97 56.76
7 71.43 68.57 65.71 82.86 77.14 30 97.30 75.68 64.86
8 - - - - - 31 59.46 81.08 64.86
9 - - - - - 32 81.08 72.97 81.08
10 75.00 69.44 72.22 66.67 50.00 33 72.97 78.38 56.76
11 - - - - - 34 54.05 89.19 81.08
12 75.68 78.38 62.16 72.97 75.68 35 78.38 86.49 72.97
13 64.86 64.86 78.38 89.19 70.27 36 83.78 67.57 64.86
14 81.08 72.97 91.89 62.16 83.78 37 75.68 78.38 70.27
15 89.19 67.57 59.46 91.89 81.08 38 86.49 91.89 54.05
16 78.38 78.38 83.78 70.27 70.27 39 71.43 71.43 68.57
17 81.08 75.68 67.57 72.97 75.68 40 78.38 86.49 62.16
18 75.68 72.97 75.68 78.38 81.08 41 75.68 83.78 70.27
19 75.68 78.38 83.78 83.78 56.76 42 78.38 59.46 54.05
20 72.97 72.97 67.57 97.30 67.57 43 86.49 78.38 75.68
21 78.38 67.57 78.38 75.68 70.27 44 72.22 91.67 72.22
22 62.16 67.57 78.38 81.08 72.97 45 78.38 83.78 67.57
23 81.08 70.27 81.08 59.46 81.08 46 62.16 67.57 67.57

significant differences between 2 groups of people. Figure 4.7 shows that musicians

tend to be more activated in the memory test.

4.4 Summary and Future Work

In conclusion, the method satisfactorily predict the class of subjects. The high-

est successful rate is 97.30% which occurs in condition 30 and Epoch B1.

Because different events may have different responses, so based on the event

markers, the sessions are separated into several small parts for detailed analysis.

121
fp1 fp2

f7 f8

f3 f4
fz
ft9 ft10

fc5 fc6
fc1 fc2

c3 cz c4 t8
t7

cp1 cp2
cp5 cp6

pz
p3 p4

p7 p8

o1 oz o2
−3.78

+3.78
0 246

Time (ms)

Figure 4.6: Comparison for the EEG signals of 30 channels of musicians (blue line)
and non-musicians (red line) at epoch B1 and condition 30.

122
Latency 100 ms from time=100ms

2.3

1.2

−1.2

−2.3

(a) The topography of non-musicians at epoch B1 and condition


30 at 100sec

(b) The topography of musicians at epoch B1 and condition 30


at 100sec

Figure 4.7: Head plot for musicians and non-musicians at epoch B1 at 100sec with
ICA-Based Artifact Removal

123
Table 4.5: The table of the classification sensitivity and specificity for 46 conditions
and 8 epochs with 5-fold cross validation and 10 features selected by mRMR and with
ICA artifacts removal
Epoch Epoch
A1 A2 A3 A4 A5 B1 B2 B3
cond. sen spec sen spec sen spec sen spec sen spec cond. sen spec sen spec sen spec
1 0.63 0.39 0.63 0.94 0.89 0.78 0.95 0.67 0.84 0.56 24 0.74 0.44 0.84 0.72 0.68 0.61
2 0.58 0.44 0.58 0.83 0.68 0.61 0.79 0.67 0.74 0.61 25 0.79 0.94 0.84 0.78 0.58 0.50
3 0.89 0.72 0.79 0.61 0.63 0.72 0.84 0.89 0.89 0.78 26 0.68 0.67 0.74 0.72 0.79 0.67
4 - - - - - - - - - - 27 0.84 0.61 0.95 0.89 0.84 0.56
5 - - - - - - - - - - 28 0.84 0.44 0.79 0.83 0.79 0.83
6 0.68 0.61 0.74 0.83 0.53 0.72 0.68 0.83 0.63 0.83 29 0.68 0.72 0.95 0.83 0.58 0.61
7 0.89 0.69 0.84 0.50 1.00 0.75 0.74 0.75 0.84 0.69 30 0.89 1.00 0.89 0.83 0.89 0.89
8 - - - - - - - - - - 31 0.63 0.72 0.89 0.78 0.74 0.72
9 - - - - - - - - - - 32 0.79 0.61 0.89 0.72 0.74 0.72
10 0.79 0.71 0.84 0.76 0.84 0.71 0.84 0.88 0.84 0.71 33 0.79 0.67 0.95 0.67 0.32 0.83
11 - - - - - - - - - - 34 0.58 0.61 0.89 0.67 0.74 0.67
12 0.68 0.67 0.63 0.67 0.79 0.78 0.79 0.78 0.79 0.61 35 0.89 0.67 0.79 0.56 0.58 0.67
13 0.47 0.72 0.68 0.67 0.84 0.67 0.79 0.78 0.84 0.67 36 0.68 0.78 0.95 0.72 0.63 0.89
14 0.58 0.67 0.95 0.61 0.74 0.67 0.74 0.78 0.84 0.94 37 0.79 0.72 0.89 0.83 0.53 0.50
15 0.58 0.56 0.74 0.83 0.74 0.78 0.74 0.89 0.74 0.83 38 0.74 0.50 0.58 0.83 0.63 0.50
16 0.84 0.78 0.84 0.78 0.84 0.83 0.79 0.94 0.84 0.72 39 0.58 0.94 0.53 0.63 0.74 0.56
17 0.74 0.72 0.74 0.56 0.79 0.72 0.68 0.78 0.74 0.61 40 0.89 0.67 0.79 0.83 0.89 0.72
18 0.63 0.56 0.79 0.83 0.63 0.61 0.68 0.94 0.68 0.78 41 0.68 0.78 0.74 0.72 0.79 0.67
19 0.74 0.50 0.74 0.56 0.79 0.67 0.74 0.72 0.79 0.78 42 0.84 0.61 0.84 0.67 0.74 0.72
20 0.95 0.72 0.53 0.67 0.58 0.44 0.79 0.56 0.79 0.72 43 0.84 0.78 0.89 0.72 0.63 0.61
21 0.84 0.89 0.95 0.67 1.00 0.83 0.68 0.72 0.84 0.78 44 0.78 0.72 0.61 0.67 0.61 0.67
22 0.63 0.67 0.89 0.83 0.84 0.44 0.89 0.89 0.53 0.50 45 0.79 0.44 0.79 0.61 0.79 0.61
23 0.53 0.61 0.68 0.72 0.89 0.67 0.74 0.78 0.84 0.78 46 0.95 0.72 0.79 0.67 0.79 0.67

There are only two classes - musicians and non-musicians - in this prediction process.

Univariate features are extracted from the 30 channels of EEG signals. We have

consider Signal Power Features, Band Power Asymmetry, Morphological Features,

Statistical Features and Time-Frequency Features. Artifact removal based on ICA

gives better results than directly using the raw data.

In the future, we will consider some outlier removal techniques on epochs. Bad

data does exist in every EEG data. There are many possible causes, such as muscle

movement or distraction of participant during the experiment. The performance is

expected to be enhanced by removing the contaminated epochs.

124
Table 4.6: The table of the classification sensitivity and specificity for 46 conditions
and 8 epochs with 5-fold cross validation and 10 features selected by mRMR and with
ICA artifacts removal
Epoch Epoch
A1 A2 A3 A4 A5 B1 B2 B3
cond. sen spec sen spec sen spec sen spec sen spec cond. sen spec sen spec sen spec
1 0.84 0.89 0.63 0.67 0.84 0.61 0.84 0.78 0.53 0.72 24 0.53 0.61 0.68 0.67 0.74 0.50
2 0.74 0.83 0.79 0.78 0.89 0.61 0.95 0.89 0.84 0.50 25 0.58 0.39 0.84 0.72 0.53 0.83
3 0.79 0.67 0.79 0.72 0.79 0.78 0.74 0.67 0.58 0.67 26 0.84 0.89 0.84 0.44 0.74 0.50
4 - - - - - - - - - - 27 0.68 0.50 0.89 0.78 0.84 0.72
5 - - - - - - - - - - 28 0.74 0.83 0.84 0.72 0.79 0.61
6 0.74 0.94 0.84 0.78 0.89 0.83 0.89 0.72 0.68 0.78 29 0.79 0.83 0.84 0.61 0.58 0.56
7 0.68 0.75 0.79 0.56 0.74 0.56 0.84 0.81 0.84 0.69 30 0.95 1.00 0.89 0.61 0.63 0.67
8 - - - - - - - - - - 31 0.84 0.33 1.00 0.61 0.84 0.44
9 - - - - - - - - - - 32 0.84 0.78 0.79 0.67 0.79 0.83
10 0.89 0.59 0.74 0.65 0.84 0.59 0.74 0.59 0.42 0.59 33 0.84 0.61 0.89 0.67 0.79 0.33
11 - - - - - - - - - - 34 0.53 0.80 0.79 1.00 0.79 0.83
12 0.79 0.72 0.95 0.61 0.58 0.67 0.89 0.56 0.79 0.72 35 0.84 0.72 0.84 0.89 0.74 0.72
13 0.63 0.67 0.63 0.67 0.74 0.83 0.95 0.83 0.74 0.67 36 0.89 0.78 0.74 0.61 0.58 0.72
14 0.95 0.67 0.68 0.78 0.95 0.89 0.63 0.61 0.74 0.94 37 0.68 0.83 0.79 0.78 0.74 0.67
15 0.95 0.83 0.68 0.67 0.42 0.78 0.95 0.89 0.89 0.72 38 1.00 0.72 0.95 0.89 0.58 0.50
16 0.84 0.72 0.89 0.67 0.89 0.78 0.84 0.56 0.74 0.67 39 0.63 0.81 0.74 0.69 0.68 0.69
17 0.79 0.83 0.84 0.67 0.89 0.44 0.79 0.67 0.68 0.83 40 0.95 0.61 0.95 0.78 0.68 0.56
18 0.68 0.83 0.74 0.72 0.79 0.72 0.74 0.83 0.79 0.83 41 0.79 0.72 0.84 0.83 0.63 0.78
19 0.63 0.89 0.74 0.83 0.84 0.83 0.89 0.78 0.68 0.44 42 0.84 0.72 0.68 0.50 0.58 0.50
20 0.74 0.72 0.74 0.72 0.68 0.67 0.95 1.00 0.63 0.72 43 0.84 0.89 0.84 0.72 0.63 0.89
21 0.74 0.83 0.74 0.61 0.84 0.72 0.68 0.83 0.58 0.83 44 0.72 0.72 1.00 0.83 0.67 0.78
22 0.84 0.39 0.74 0.61 0.79 0.78 0.84 0.78 0.74 0.72 45 0.74 0.83 0.84 0.83 0.89 0.44
23 0.84 0.78 0.68 0.72 0.84 0.78 0.74 0.44 0.74 0.89 46 0.63 0.61 0.68 0.67 0.84 0.50

125
CHAPTER 5

Conclusions and Future Research

This dissertation focuses on the methodologies for addressing the two problems

as introduced in Section 2 which solve prediction problems in healthcare and service

industries. The problems will involve both stationary and nonstationary time series.

Chapter 2 presents the details of application of ARIMA and Dynamic linear

model on stationary time series prediction problem for healthcare and railroad in-

dustries. ARIMA and DLM represent two different ways to explain and model time

series.

Dynamic Linear Model (DLM) which is a special type of State Space methods,

has been developed as alternative tools for time series forecasting . However, to ap-

ply DLM, the signal-to-noise-ratio R has to be specified. Since the true value of R

is generally not available, the only way is to guess a value which is inconvenient and

unreliable. To conquer this problem, we propose a method to estimate R automati-

cally in the forecasting procedure. The properties of the proposed R estimator and

the new forecasting procedure with this estimator are studied by simulations.

In Chapter 3, we described our proposed novel pattern matching based semi-

periodic time series prediction framework and applied it on respiratory motion time

series prediction. In radiotherapy, system latencies need to be compensated for accu-

rate irradiation during treatment. Accurate respiratory motion prediction can mini-

mize the damage of normal body tissues and important human organs.

Pattern matching can effectively utilize the existing information from the data.

Similar pattern demonstrates similar trends in the response. The pattern recognition

126
process is enhanced by combining with statistical and feature analysis which help

to obtain better matched patterns and remove undesired patterns. The experimental

results show that the prediction of the proposed method is very accurate and it is also

robust to different kind of patients. We have compared the proposed novel pattern-

based method to the current state-of-the-art methods and found that the proposed

method outperforms all other methods. It should greatly contribute to tumor position

prediction in order to help the cancer patients to enhance their life quality.

Chapter 4 presents a comprehensive study of EEG time series data mining on

classification of the EEG signals of musicians and non-musicians. The objective of

the study is to predict if an EEG signal belongs to a musician or a non-musicians.

The EEG signals are first cleaned by ICA-based artifacts removal and outlier epoch

rejection. Then, the features of the EEG signals are extracted by using extensive

algorithms. proximal support vector machine (PSVM) is computational friendly and

efficient. The performance of PSVM is usually satisfactory. So, we use PSVM as the

classifier in our study.

To sum up, the method satisfactorily predict the class of subjects. The highest

successful rate is 97.30% which occurs in condition 7 and Epoch 8.

Artifacts removal is a challenging task. We want to remove noisy signals but

retain the useful information. Many work have been done on this problem but not

many give satisfactory result. In our study, we apply ICA to decompose the signal

into ICs and then remove those are considered as components of artifacts before

reconstructing the signal. From the result in Chapter 4, we are able to show that our

artifacts removal does significantly enhance the classification performance.

In future, we will consider some outlier removal techniques on epochs. Bad

data does exist in every EEG data. There are many possible causes, such as muscle

127
movement or distraction of participant during the experiment. The performance is

expected to be enhanced by removal contaminated epochs.

128
REFERENCES

[1] J. G. DeGooijer and R. J. Hyndman, “25 years of time series forecasting,” In-

ternational journal of forecasting, pp. 443–473, 2006.

[2] C. A. Ratanamahatana, J. Lin, D. Gunopulos, E. Keogh, M. Vlachos, and G. Das,

“Mining time series data,” Data Mining and Knowledge Discovery Handbook, pp.

1049–1077, 2010.

[3] G. Rubio, H. Pomares, I. Rojas, and L. J. Herrera, “A heuristic method for pa-

rameter selection in ls-svm: Application to time series prediction.” International

Journal of Forecasting, pp. 725–739, 2011.

[4] J. d. Preez and S. F. Witt, “Univariate versus multivariate time series forecast-

ing: an application to international tourism demand,” International Journal of

Forecasting, pp. 435–451, 2003.

[5] P. J. Keall, G. S. Mageras, J. M. Balter, R. S. Emery, K. M. Forster, S. B. Jiang,

and E. Yorke, “The management of respiratory motion in radiation oncology

report of aapm task group 76a),” Medical physics, vol. 33, no. 10, pp. 3874–3900,

2006.

[6] M. Bigovic, “Demand forecasting within montenegrin tourism using box-jenkins

methodology for seasonal arima models,” Tourism and Hospitality Management,

pp. 1–18, 2012.

[7] S. Wang, “Construct an optimal triage prediction model: A case study of the

emergency department of a teaching hospital in taiwan,” Journal of Medical

Systems, p. 37, 2013.

129
[8] Y. Chang and M. Liao, “A seasonal arima model of tourism forecasting: The

case of taiwan,” asia pacific journal of tourism research,” Asia Pacific Journal

of Tourism Research, pp. 215–221, 2010.

[9] D. N. J. Peck, J. Benneyan and S. Gaehde, “Predicting emergency department

inpatient admissions to improve same-day patient flow,” Academic Emergency

Medicine, p. 1045, 2012.

[10] X. L. M. Babcocka and J. Norton, “Time series forecasting of quarterly railroad

grain carloadings,” Transportation Research Part E, pp. 43–57, 1999.

[11] E. Walter, “Models with trend,” Applied Econometric Time Series (Second ed.).

New York: Wiley, pp. 156–238, 2004.

[12] R. H. A. DE LIVERA and R. SNYDER, “Forecasting time series with com-

plex seasonal patterns using exponential smoothing,” Journal of the American

Statistical Association, p. 1513, 2014.

[13] P. Yelland, “Bayesian forecasting for low-count time series using state-

space models: An empirical evaluation for inventory management,” Int.

J.ProductionEconomics, pp. 95–103, 2009.

[14] N. Homma, M. Sakai, H. Endo, M. Mitsuya, Y. Takai, and M. Yoshizawa, “A

new motion management method for lung tumor tracking radiation therapy,”

WSEAS Transactions on Systems, vol. 8, no. 4, pp. 471–480, 2009.

[15] F. Ernst, A. Schlaefer, and A. Schweikard, “Prediction of respiratory motion

with wavelet-based multiscale autoregression,” Medical Image Computing and

Computer-Assisted InterventionMICCAI 2007, pp. 668–675, 2007.

[16] F. Ernst and A. Schweikard, “Forecasting respiratory motion with accurate online

support vector regression (svrpred),” International journal of computer assisted

radiology and surgery, vol. 4, no. 5, pp. 439–447, 2009.

130
[17] K. Ichiji, N. Homma, M. Sakai, Y. Narita, Y. Takai, and X. Zhang, “A time-

varying seasonal autoregressive model-based prediction of respiratory motion

for tumor following radiotherapy,” Computational and mathematical methods in

medicine, 2013.

[18] S. Wang, “Online monitoring and prediction of complex time series events from

nonstationary time series data,” Ph.D. dissertation, The State University of New

Jersey, 2012.

[19] B. Abraham and J. Ledolter, “Statistical methods for forecasting,” Hoboken, NJ:

John Wiley and Sons, Inc, 2005.

[20] G. E. P. Box and G. M. Jenkins, “Time series analysis: forecasting and control,”

Francisco Holden-Day.

[21] Introduction to arima: nonseasonal models.

[22] M. West and J. Harrison, “Bayesian forecasting and dynamic models,” New York,

NY: Springer-Verlag New York, Inc., 1997.

[23] X. Fei, Y. Zhange, K. Liu, and M. Guo, “Bayesian dynamic linear model with

switching for real-time short-term freeway travel time prediction with license

plate recognition data.” Journal of Transportation Engineering, vol. 139, no. 11,

p. 1058, 2013.

[24] F. Ernst, R. Drichen, A. Schlaefer, and A. Schweikard, “Evaluating and com-

paring algorithms for respiratory motion prediction,” Physics in medicine and

biology, p. 3911, 2013.

[25] D. Ruan, “Image guided respiratory motion analysis: time series and image

registration,” Ph.D. dissertation, The University of Michigan, Ann Arbor, 2008.

[26] F. Ernst, Compensating for quasi-periodic Motion in robotic radiosurgery).

Springer, 2011.

131
[27] K. Ichiji, N. Homma, M. Sakai, M. Abe, N. Sugita, and M. Yoshizawa, A Respi-

ratory Motion Prediction Based on Time-Variant Seasonal Autoregressive Model

for Real-Time Image-Guided Radiotherapy. INTECH, 2013, ch. Chapter 5, pp.

75–90.

[28] A. Krauss, A. Nill, and U. Oelfke, “The comparative performance of four res-

piratory motion predictors for real-time tumour tracking,” Physics in medicine

and biology, vol. 56, no. 16, p. 5303, 2011.

[29] D. Ruan, “Kernel density estimation-based real-time prediction for respiratory

motion,” Physics in medicine and biology, vol. 55, no. 5, p. 1311, 2010.

[30] O. Renaud, J. L. Starck, and F. Murtagh, “Prediction based on a multiscale

decomposition,” International Journal of Wavelets, Multiresolution and Infor-

mation Processing, vol. 1, no. 02, pp. 217–232, 2003.

[31] Y. Chen, B. Yang, and J. Dong, “Time-series prediction using a local linear

wavelet neural network,” neurocomputingt,” Neurocomputing, vol. 69, no. 4, pp.

449–465, 2006.

[32] S. Choi, Y. Chang, N. Kim, S. Park, S. Y. Song, and H. S. Kang, “Performance

enhancement of respiratory tumor motion prediction using adaptive support vec-

tor regression: Comparison with adaptive neural network method,” International

Journal of Imaging Systems and Technology, vol. 24, no. 1, pp. 8–15, 2014.

[33] S. Guo, R. M. Lucas, and A. L. Ponsonby, “A novel approach for prediction of

vitamin d status using support vector regression,” PloS one, vol. 8, no. 11, p.

e79970, 2013.

[34] N. Riaz, P. Shanker, R. Wiersma, O. Gudmundsson, W. Mao, B. Widrow, and

L. Xing, “Predicting respiratory tumor motion with multi-dimensional adaptive

filters and support vector regression,” Physics in medicine and biology, vol. 54,

no. 19, p. 5735, 2009.


132
[35] A. J. Smola and B. Schlkopf, “A tutorial on support vector regression,” Statistics

and computing, vol. 14, no. 3, pp. 199–222, 2004.

[36] Y. Bao, T. Xiong, and Z. Hu, “Multi-step-ahead time series prediction using

multiple-output support vector regression,” Neurocomputing, vol. 129, pp. 482–

493, 2014.

[37] K. Ichiji, N. Homma, M. Sakai, Y. Takai, Y. Narita, M. Abe, and M. Yoshizawa,

“Respiratory motion prediction for tumor following radiotherapy by using time-

variant seasonal autoregressive techniques,” Engineering in Medicine and Biology

Society (EMBC), 2012 Annual International Conference of the IEEE, vol. 8, pp.

6028–6031, 2012.

[38] S. Wang, J. Gwizdka, and W. A. Chaovalitwongse, “Using physiological signals

to assess mental motion for tumor following radiotherapy,” IEEE TRANSAC-

TIONS ON HUMAN-MACHINE SYSTEMS, 2014.

[39] E. Fuchs, T. Gruber, J. Nitschke, and B. Sick, “Online segmentation of time

series based on polynomial least-squares approximations,” Pattern Analysis and

Machine Intelligence, IEEE Transactions on, pp. 2232–2245, 2010.

[40] Y. S. Jeong, M. K. Jeong, and O. A. Omitaomu, “Weighted dynamictimewarp-

ingfortimeseriesclassification,” 2011.

[41] R. Croft and R. Barry, “Removal of ocular artifact from the eeg: A review.”

Clinical Neurophysiology, vol. 30, pp. 5–19, 2000.

[42] A. Mognon, J. Jovicich, L. Bruzzone, and M. Buiatti, “Adjust: An automatic

eeg artifact detector based on the joint use of spatial and temporal features.”

Psychophysiology, vol. 48, no. 2, pp. 229–240, 2011.

[43] D. Grimes, D. Tan, S. Hudson, P. Shenoy, , and R. Rao, “Feasibility and prag-

matics of classifying working memory load with an electroencephalograph,” In

133
Proceedings of the SIGCHI Conference on Human Factors in Computing Sys-

tems, vol. 08, pp. 835–844, 2008.

[44] R. Yuvaraj, M. Murugappan, N. M. Ibrahim, M. I. Omar, K. Sundaraj, K. Mo-

hamad, R. Palaniappan, E. Mesquita, and M. Satiyan, “On the analysis of eeg

power, frequency and asymmetry in parkinsons disease during emotion process-

ing,” Behavioral and Brain Functions, vol. 10, no. 1, p. 12, 2014.

[45] S. Wang, C. Lin, C. Wu, and W. Chaovalitwongse, “Early detection of numerical

typing errors using data mining techniques,” IEEE Transactions on Systems,

Man, and Cybernetics, Part A: Systems and Humans, vol. 41, no. 6, pp. 1199–

1212, 2011.

[46] S. Wong, G. Baltuch, J. Jaggi, and S. Danish, “Functional localization and vi-

sualization of the subthalamic nucleus from microelectrode recordings acquired

during dbs surgery with unsupervised machine learning,” Journal of Neural En-

gineering, vol. 6, p. 026006, 2009.

[47] D. Olsen, R. Lesser, J. Harris, R.Webber, and J. Cristion, “Automatic detection

of seizures using electroencephalographic signals,” U.S. Patent 5311876, 1994.

[48] R. Esteller, J. Echauz, T. Cheng, B. Litt, and B. Pless, “An efficient feature

for seizure onset detection,” Proceedings of the 23rd International Conference of

IEEE Engineering Medicine Biology Society, vol. 2, pp. 1707–1710, 2001.

[49] R. Esteller, J. Echauz, and T. Tcheng, “Comparison of line length feature before

and after brain electrical stimulation in epileptic patients,” Proceedings of the

26th International Conference of IEEE Engineering Medicine Biology Society,

pp. 4710–4713, 2004.

[50] J. Kaiser, “On a simple algorithm to calculate the energy of a signal,” Proceedings

of 1990 International Conference of Acoustis, Speech, Signal Processing, vol. 1,

pp. 381–384, 1990.


134
[51] R. Agarwal and J. Gotman, “Adaptive segmentation of electroencephalographic

data using a nonlinear energy operator,” Proceedings of 1999 IEEE International

Symposium on Circuits and Systems, vol. 4, pp. 199–202, 1999.

[52] N. Addison, “The illustrated wavelet transform handbook: introductory the-

ory and applications in science, engineering, medicine, and finance,” Taylor and

Francis, 2002.

[53] O. Rosso, M. Martin, A. Figliola, K. Keller, and A. Plastino, “Eeg analysis

using wavelet-based information tools,” Journal of Neuroscience Methods, vol.

153, no. 2, pp. 163–182, 2006.

[54] A. Subasi, “Eeg signal classification using wavelet feature extraction and a mix-

ture of expert model,” Expert Systems with Applications, vol. 32, no. 4, pp.

1084–1093, 2007.

[55] O. Rosso, S. Blanco, J. Yordanova, V. Kolev, A. Figliola, M. Schourmann, and

E. Basar, “Wavelet entropy: a new tool for analysis of short duration brain

electrical signals,” Journal of Neuroscience Methods, vol. 105, no. 1, pp. 65–75,

2001.

[56] C. Shannon, “A mathematical theory of communication,” Bell System Technical

Journal, vol. 27, no. 379423, pp. 623–656, 1948.

[57] B. Blankertz, G. Curio, and K. Muller, “Classifying single trial eeg: towards

brain computer interfacing,” Advances in Neural Information Processing Sys-

tems, vol. 14, no. 2, pp. 157–164, 2002.

[58] T. Lal, T. Hinterberger, G. Widman, M. Schroer, J. Hill, W. Rosenstiel, C. Elger,

B. Schokopf, and N. Birbaumer, “Advances in neural information processing

systems, volume 17, chapter methods towards invasive human brain computer

interfacess,” MIT Press, vol. 17, pp. 737–744, 2005.

135
[59] A. Rakotomamonjy, V. Guigue, G. Mallet, and V. Alvarado, “Ensemble of svms

for improving brain computer interface p300 speller performances,” Artificial

Neural Networks: Biological Inspirations ICANN 2005, volume 3696, chapter.

Springer Berlin / Heidelberg, vol. 4, pp. 45–5, 2005.

[60] M. Kaper, P. Meinicke, U. Grossekathoefer, T. Lingner, and H. Ritter, “Bci

competition 2003-data set iib: support vector machines for the p300 speller

paradigm,” IEEE Transactions on Biomedical Engeneering, vol. 51, no. 6, pp.

1073–1076, 2004.

[61] D. Garrett, D. Peterson, C. Anderson, and M. Thaut, “Comparison of linear,

nonlinear, and feature selection methods for eeg signal classification,” IEEE

Transactions on Neural Systems and Rehabilitation Engineering, vol. 11, no. 2,

pp. 141–144, 2003.

[62] O. Mangasarian and E. Wild, “Proximal support vector machine classifiers,” In

Proceedings of Knowledge Discovery and Data Mining, pp. 77–86, 2001.

[63] M. Stone, “Cross-validatory choice and assessment of statistical predictions,”

Journal of the Royal Statistical Society: Series B (Statistical Methodological),

vol. 36, no. 2, pp. 111–147, 1974.

[64] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual informa-

tion criteria of max-dependency, max-relevance, and min-redundancy,” Pattern

Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 8, pp.

1226–1238, 2005.

136
BIOGRAPHICAL STATEMENT

Jerry K.M. Kam joined the Department of Industrial & Manufacturing System

Engineering at UTA in the Fall of 2010. He received his B.S. degree in Industrial

Engineering & Engineering Management from City University of Hong Kong. He is

co-advised by Prof. Li Zeng and Prof. Shouyi Wang on his PhD study. Currently,

he is working with Prof. Wang on research problems in the field of time series data

mining including respiratory motion time series prediction, time series segmentation

and EEG signals classification.

137

You might also like