This action might not be possible to undo. Are you sure you want to continue?
HIDDEN MARKOV MODELS
by
Yingjian Zhang
B.Eng. Shandong University, China, 2001
a thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Applied Science
in the School
of
Computing Science
c Yingjian Zhang 2004
SIMON FRASER UNIVERSITY
May 2004
All rights reserved. This work may not be
reproduced in whole or in part, by photocopy
or other means, without the permission of the author.
APPROVAL
Name: Yingjian Zhang
Degree: Master of Applied Science
Title of thesis: Prediction of Financial Time Series with Hidden Markov Models
Examining Committee: Dr. Ghassan Hamarneh
Chair
Dr. Anoop Sarkar
Assistant Professor, Computing Science
Senior Supervisor
Dr. Andrey Pavlov
Assistant Professor, Business Administration
CoSenior Supervisor
Dr. Oliver Schulte
Assistant Professor, Computing Science
Supervisor
Dr. Martin Ester
Associate Professor, Computing Science
SFU Examiner
Date Approved:
ii
Abstract
In this thesis, we develop an extension of the Hidden Markov Model (HMM) that addresses
two of the most important challenges of ﬁnancial time series modeling: nonstationary and
nonlinearity. Speciﬁcally, we extend the HMM to include a novel exponentially weighted
ExpectationMaximization (EM) algorithm to handle these two challenges. We show that
this extension allows the HMM algorithm to model not only sequence data but also dynamic
ﬁnancial time series. We show the update rules for the HMM parameters can be written
in a form of exponential moving averages of the model variables so that we can take the
advantage of existing technical analysis techniques. We further propose a double weighted
EM algorithm that is able to adjust training sensitivity automatically. Convergence results
for the proposed algorithms are proved using techniques from the EM Theorem.
Experimental results show that our models consistently beat the S&P 500 Index over
ﬁve 400day testing periods from 1994 to 2002, including both bull and bear markets. Our
models also consistently outperform the top 5 S&P 500 mutual funds in terms of the Sharpe
Ratio.
iii
To my mom
iv
Acknowledgments
I would like to express my deep gratitude to my senior supervisor, Dr. Anoop Sarkar for his
great patience, detailed instructions and insightful comments at every stage of this thesis.
What I have learned from him goes beyond the knowledge he taught me; his attitude toward
research, carefulness and commitment has and will have a profound impact on my future
academic activities.
I am also indebted to my cosenior supervisor, Dr. Andrey Pavlov, for trusting me before
I realize my results. I still remember the day when I ﬁrst presented to him the ”super ﬁve
day pattern” program. He is the source of my knowledge in ﬁnance that I used in this
thesis. Without the countless time he spent with me, this thesis surely would not have been
impossible.
I cannot thank enough my supervisor, Dr. Oliver Schulte. Part of this thesis derives
from a course project I did for his machine learning course in which he introduced me the
world of machine learning and interdisciplinary research approaches.
Many of my fellow graduate students oﬀered great help during my study and research at
the computing science school. I thank Wei Luo for helping me with many issues regarding
the L
A
T
E
Xand Unix; Cheng Lu and Chris Demwell for sharing their experiences with Hidden
Markov Models; my labmates at the Natural Language Lab for the comments at my practice
talk and for creating such a nice environment in which to work.
I would also like to thank Mr. Daniel Brimm for the comments from the point of view
of a real life business professional.
Finally I would like to thank my mom for all that she has done for me. Words cannot
express one millionth of my gratitude to her but time will prove the return of her immense
love.
v
Contents
Approval ii
Abstract iii
Dedication iv
Acknowledgments v
Contents vi
List of Tables ix
List of Figures x
1 Introduction 1
1.1 Financial Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Return of Financial Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Standard & Poor 500 data . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Traditional Techniques and Their Limitations . . . . . . . . . . . . . . . . . . 5
1.5 Noneﬃciency of Financial Markets . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.9 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Hidden Markov Models 13
2.1 Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
vi
2.2 A Markov chain of stock market movement . . . . . . . . . . . . . . . . . . . 14
2.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Why HMM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 General form of an HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Discrete and continuous observation probability distribution . . . . . . . . . . 20
2.6.1 Discrete observation probability distribution . . . . . . . . . . . . . . . 20
2.6.2 Continuous observation probability distribution . . . . . . . . . . . . . 21
2.7 Three Fundamental Problems for HMMs . . . . . . . . . . . . . . . . . . . . . 22
2.7.1 Problem 1: ﬁnding the probability of an observation . . . . . . . . . . 23
2.7.2 Problem 2: Finding the best state sequence . . . . . . . . . . . . . . . 27
2.7.3 Problem 3: Parameter estimation . . . . . . . . . . . . . . . . . . . . . 29
2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 HMM for Financial Time Series Analysis 32
3.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 BaumWelch Reestimation for Gaussian mixtures . . . . . . . . . . . . . . . 34
3.3 Multivariate Gaussian Mixture as Observation Density Function . . . . . . . 37
3.3.1 Multivariate Gaussian distributions . . . . . . . . . . . . . . . . . . . . 38
3.3.2 EM for multivariate Gaussian mixtures . . . . . . . . . . . . . . . . . 38
3.4 Multiple Observation Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Dynamic Training Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Making Predictions with HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.1 Setup of HMM Parameters . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.2 Making predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Exponentially Weighted EM Algorithm for HMM 46
4.1 Maximumlikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 The Traditional EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Rewriting the Q Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Weighted EM Algorithm for Gaussian Mixtures . . . . . . . . . . . . . . . . . 50
4.5 HMGM Parameter Reestimation with the Weighted EM Algorithm . . . . . 54
4.6 Calculation and Properties of the Exponential Moving Average . . . . . . . . 55
4.7 HMGM Parameters Reestimation in a Form of EMA . . . . . . . . . . . . . 58
vii
4.8 The EM Theorem and Convergence Property . . . . . . . . . . . . . . . . . . 61
4.9 The Double Weighted EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . 63
4.9.1 Problem with the Exponentially Weighted Algorithm . . . . . . . . . . 63
4.9.2 Double Exponentially Weighted Algorithm . . . . . . . . . . . . . . . 63
4.10 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Experimental Results 65
5.1 Experiments with Synthetic Data Set . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Recurrent Neural Network Results . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 HMGM Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Test on Data Sets from Diﬀerent Time Periods . . . . . . . . . . . . . . . . . 71
5.5 Comparing the Sharpe Ratio with the Top Five S&P 500 Mutual Funds . . . 72
5.6 Comparisons with Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . 79
6 Conclusion and Future Work 81
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A Terminology List 84
Bibliography 86
viii
List of Tables
1.1 Data format of the experiment (for the segment from Jan 2, 2002 to Jan 7, 2002) . . 5
1.2 Notation list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 State transition probability matrix of the HMM for stock price forecast. As the
market move from one day to another, the professional might change his strategy
without letting others know . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 Comparison of the recurrent network and exponentially weighted HMGM on the
synthetic data testing. The EWHMGM outperforms the recurrent network in overall
result, and has less volatility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Comparison of the Sharpe Ratio between the HMM and the top 5 S&P 500 funds
during the period of August 5, 1999 to March 7, 2001. Four of the ﬁve funds have
negative Sharpe Ratios. The absolute value of the positive one is much smaller than
the HMGM prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Comparison of the Sharpe Ratios during the period of March 8, 2001 to October 11,
2002. The Double weighted HMGM consistently outperforms the top ﬁve S&P 500
mutual funds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
ix
List of Figures
1.1 The return curve of a prediction system. $1 input at the beginning of the trading
period generates a return of $2.1 while the index drops below 1. . . . . . . . . . 4
1.2 A Super Five Day Pattern in the NASDAQ Composite Index from 1984 to 1990.
The ﬁrst three days and the last two days tend to have higher return than the rest
of the trading days. However, the pattern diminished in the early 1990s. . . . . . . 8
2.1 Modeling the possible movement directions of a stock as a Markov chain. Each of
the nodes in the graph represent a move. As time goes by, the model moves from
one state to another, just as the stock price ﬂuctuates everyday. . . . . . . . . . . . 15
2.2 The ﬁgure shows the set of strategies that a ﬁnancial professional can take. The
strategies are modeled as states. Each strategy is associated with a probability of
generating the possible stock price movement. These probabilities are hidden to the
public, only the professional himself knows it. . . . . . . . . . . . . . . . . . . . . 17
2.3 The trellis of the HMM. The states are associated with diﬀerent strategies and are
connected to each other. Each state produces an observation picked from the vector
[LR, SR, UC, SD, LD]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 The forward procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 The backward procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Finding best path with the Viterbi algorithm . . . . . . . . . . . . . . . . . . . . 29
3.1 Three single Gaussian density functions. Each of them will contributing a percentage
to the combined output in the mixture. . . . . . . . . . . . . . . . . . . . . . . . 35
x
3.2 A Gaussian mixture derived from the three Gaussian densities above. For each
small area δ close to each point x on the x axis, there is a probability of 20% that
the random variable x is governed by the ﬁrst Gaussian density function, 30% of
probability that the distribution of x is governed by the second Gaussian, and 50%
of probability by the third Gaussian. . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 The dynamic training pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Hidden Markov Gaussian Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 When the 30day moving average moves above the 100day moving average, the trend
is considered bullish. When the 30day moving average declines below the 100day
moving average, the trend is considered bearish. (picture from stockcharts.com) . . . 56
4.2 The portion of each of the ﬁrst 19 days’ EMA in calculating the EMA for the 20th
day. As is shown in the ﬁgure, the EMA at the 19th day accounts for 2/3 of the
EMA at day 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 The portion of each of the 20 days’ actual data in calculating the EMA for the 20th
day. The last day account for 1/3 of the EMA at day 20. The data earlier than day
10 take up a very small portion but they are still included in the overall calculation . 59
4.4 The prediction falls behind the market during highly volatile periods of time. This
is because the system is too sensitive to recent changes. . . . . . . . . . . . . . . . 63
5.1 Experimental results on computer generated data. . . . . . . . . . . . . . . . . . . 66
5.2 Prediction of the NASDAQ Index in 1998 with a recurrent neural network. . . . . . 67
5.3 Prediction of the NASDAQ Index in 1999 with a recurrent neural network. . . . . . 68
5.4 Prediction of the NASDAQ Index in 2000 with a recurrent neural network. . . . . . 68
5.5 Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. Training
window size = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6 Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. Training
window size = 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.7 Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. Training
window size = 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.8 Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. Training
window size = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.9 Prediction with HMGM for 400 days of the S&P 500 Index. Note from t213 to t228,
the HMM was not able to catch up with the change in trend swiftly. . . . . . . . . 73
xi
5.10 Prediction with multivariate HMGM for 400 days of the S&P 500 Index. . . . . . . 73
5.11 Prediction with single exponentially weighted HMGM . . . . . . . . . . . . . . . . 74
5.12 Prediction with double weighted HMGM that has a conﬁdence adjustment variable. 74
5.13 Results from all models for comparison. . . . . . . . . . . . . . . . . . . . . . . . 75
5.14 Results on the period from November 2, 1994 to June 3, 1996 . . . . . . . . . . . . 75
5.15 Results on the period from June 4, 1996 to December 31, 1998 . . . . . . . . . . . 76
5.16 Results on the period from August 5, 1999 to March 7, 2001 . . . . . . . . . . . . . 76
5.17 Results on the period from March 8, 2001 to October 11, 2002 . . . . . . . . . . . 77
5.18 Too many positive signals result in a curve similar to the index. . . . . . . . . . . . 80
xii
Chapter 1
Introduction
This thesis explores a novel model in the interdisciplinary area of computational ﬁnance.
Similar to traditional ﬁnancial engineering, we build models to analyze and predict ﬁnancial
time series. What distinguish us from the previous work is rather than building models based
on regression equations, we apply a machien learning model–the Hidden Markov Model to
ﬁnancial time series. This chapter describes the basic concepts in ﬁnancial time series, the
diﬃculties of accurate ﬁnancial time series prediction, the limitations of regression models,
motivation and objective of our machine learning model. The organization of the thesis is
also outlined in the later part this chapter.
1.1 Financial Time Series
Financial time series data are a sequence of prices of some ﬁnancial assets over a speciﬁc
period of time. In this thesis we will focus our research on the most volatile and challenging
ﬁnancial market – the stock market. More speciﬁcally, our experiment data include the
daily data of NASDAQ Composite Index, the Dow Jones Industrial Index and the S&P 500
Index.
Over the past 50 years, there has been a large amount of intensive research on the
stock market. All the ﬁnancial analysts carry out many kinds of research in the hope of
accomplishing one goal: to beat the market. To beat the market means to have a rate
of return that is consistently higher than the average return of the market while keeping
the same level of risk as the market. In this thesis, the average rate of return is the Dow
Jones Industrial Index , which is a weighted average of 30 signiﬁcant stocks, and the S&P
1
CHAPTER 1. INTRODUCTION 2
500 Index, which is a combination of 500 stocks. Generally there are two main streams in
ﬁnancial market analysis: fundamental analysis and technical analysis.
Fundamental analysis is the examination of the underlying forces that aﬀect the well
being of the economy, industry sectors, and individual companies. For example, to forecast
future prices of an individual stock, fundamental analysis combines economic, industry, and
company analyses to derive the stock’s current fair value and forecast its future value. If fair
value is not equal to the current stock price, fundamental analysts believe that the stock is
either over or under valued and the market price will ultimately gravitate toward fair value.
There are some inherent weaknesses in fundamental analysis. First, it is very time
consuming: by the time fundamental analysts collect information of the company or economy
health and give their prediction of the future price, the current price has already changed
to reﬂect the most updated information. Second, there is an inevitable analyst bias and
accompanying subjectivity.
Technical analysis is the examination of past price movements to forecast future price
movements. In our experiments we base our forecast of the daily stock price on the past
daily prices.
The theoretical basis of technical analysis is: price reﬂects all the relevant information;
price movements are not totally random; history repeats itself. And all these are derived
from the belief in the noneﬃciency of the ﬁnancial markets. We will discuss the market
eﬃciency hypothesis later in this chapter and the evidence of noneﬃciency in the market.
We will focus only on methods of technical analysis in this thesis.
In this thesis, we use forecast and prediction alternatively to represent the process of
generating unseen data in a ﬁnancial time series.
1.2 Return of Financial Time Series
One of the most important indicators of measuring the performance of mutual funds or
prediction systems is the return of the system. Usually we have a benchmark series to
compare against. In our experiment we use the S&P 500 index as the benchmark.
Let P
t
be the price of an asset at time index t. The oneperiod simple net return R
t
of
a ﬁnancial time series is calculated with the following formula:
CHAPTER 1. INTRODUCTION 3
R
t
=
P
t
−P
t−1
P
t−1
(1.1)
For a kperiod ﬁnancial time series, the return at the end of the period is a cumulative
return that is composed of all the periods:
1 +P
t
[k] = (1 +R
1
)(1 +R
2
) . . . (1 +R
k
)
=
k
t=1
(1 +R
t
) (1.2)
In developing a prediction system, our goal is to make as many accurate predictions
as possible. This will result in a return curve for the prediction system. The higher the
prediction accuracy, the higher the return curve will be above the index curve. Figure 1.1
shows the return based on a strategy using some prediction system. If we invest 1 dollar at
the beginning of the period, we will get $2.1 at the end of the period.
If one get an accuracy 100% for any 400 consecutive days in the ten year period 1987
to 1994, the return at the end of the period would be roughly 25 times the initial input.
This is the upper bound for the return. Normally we care more about the return than the
accuracy rate of the prediction because the simple net return at each day is diﬀerent. For
example, two days of accurate predictions of small changes may not oﬀset the loss of one
day’s incorrect prediction of a large change in the market.
Prediction of ﬁnancial time series is a very extensive topic. In this thesis, we only predict
the direction of the time series, generating a signal of either up or down for each trading
day.
1.3 The Standard & Poor 500 data
The real world ﬁnancial time series we input into the HMM prediction system is the daily
return of the Standard & Poor 500 Index (S&P 500). The S&P 500 index is a weighted
average of the stock prices of 500 leading companies in leading industry sectors of the US.
The index reﬂects the total market value of all component stocks relative to a particular
period of time. The S&P 500 Index is viewed as one of the best indicators of the US economy
health. Unlike the Dow Jones Industrial Average, which is an average of the 30 component
stock prices, the S&P 500 represents the market value of the 500 component stocks. To
CHAPTER 1. INTRODUCTION 4
0.8
1
1.2
1.4
1.6
1.8
2
2.2
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
prediction
market
Figure 1.1: The return curve of a prediction system. $1 input at the beginning of the trading period
generates a return of $2.1 while the index drops below 1.
calculate the S&P 500 Index, we ﬁrst add up the total market value of all the 500 stocks,
then divide them by the total number of outstanding shares in the market. Because it looks
at the total market value and includes stocks from diﬀerent industries than the Dow, the
S&P 500 has some diﬀerent characteristics from the Dow.
In some scenarios of the experiments we also include the Dow Jones Index that span
during the same time period as the S&P 500 Index. One reason to use another time series
to facilitate the prediction is that there are sideway movements in the long term trends,
also called the whipsaws(see 3.1). These are the reverse signals in a long term trend. We
need to identify the real trend change signals from the noisy whipsaws. The Dow is a less
sensitive index than the S&P 500 so we use it to help smooth the prediction.
The return of the S&P 500 Index for each day is calculated using the techniques
introduction in Section 1.2. The price we use for the calculation is the close price for
each day. In some experiments we use a second sequence of data: the volume. An example
of the data format is shown in Table 1.1:
CHAPTER 1. INTRODUCTION 5
Date Close Return Volume
Jan22002 1154.67 0.00574 1418169984
Jan32003 1165.27 0.00918 1874700032
Jan42002 1172.51 0.006213 1943559936
Jan72002 1164.89 0.0065 1682759936
Table 1.1: Data format of the experiment (for the segment from Jan 2, 2002 to Jan 7, 2002)
1.4 Traditional Techniques and Their Limitations
Traditionally, people use various forms of statistical models to predict volatile ﬁnancial
time series [GP86]. Sometimes stocks move without any apparent news, as the Dow Jones
Industrial Index did on Monday, October 19, 1987. It fell by 22.6 percent without any
obvious reason. We have to identify whether an irrational price movement is noise or is
part of a pattern. Perhaps the most prominent of all these models is the BlackScholes
model for stock option pricing [BS73] [Hul00]. The BlackScholes model is a statistical
model for pricing the stock option at a future time based on a few parameters. Most of
these statistical models require one assumption: the form of population distribution such
as Gaussian distributions for stock markets or binomial distribution for bonds and debts
markets [MCG04].
Before we go on to the machine learning solution to the problem, let’s take a short review
of the econometrics method. Classical econometrics has a large literature for the time series
forecast problem. The basic idea is as follows:
To make things simple, assume the time series follow a linear pattern. We then ﬁnd a
linear regression predicting the target values from the time, y
t
= a+b ×y
t−1
+noise, where
a and b are regression weights. Then work out the weights given some form of noise. This
method is rather crude but may work for deterministic linear trends. For nonlinear trends,
one needs to use a nonlinear regression such as y
t
= a ×y
t−1
+b ×y
2
t−1
. But sometimes this
nonlinear formulation is very hard to ﬁnd.
The econometrics method usually proposes a statistical model based on a set of parameters.
As these models are independent of the past prices, some of these parameters must be
estimated based on other current situations. The process of the estimation of these parameters
will bring in human analyst bias. For example, in the BlackScholes option pricing model,
CHAPTER 1. INTRODUCTION 6
the noarbitrage
1
option cost is calculated as:
C = C(s, t, K, σ, r)
= e
−rt
E[(se
W
−K)
+
]
= E[(se
W−rt
−Ke
−rt
)
+
]
= E[(se
−σ
2
t/2+σ
√
t
−Ke
−rt
)
+
] (1.3)
s is the security’s initial price, t is the exercise time of the option, K is the strike price,
r is the interest rate. These four parameters are ﬁxed and can be easily identiﬁed. W is a
normal random variable with mean (r −σ
2
/2) and variance σ
2
t. The only way to decide it
is to pick a segment of past price data, then assume the logarithm of this data follows the
normal random distribution. The choosing of the price segment and the assumption that
the logarithm of the prices follows the normal random distribution is rather subjective.
Another inherent problem of the statistical methods is that sometimes there are price
movements that cannot be modeled with one single probability density function (pdf). In
[Ros03], we see an experiment on the crude oil data. The example shows rather than just one
function, these are four distributions that determine the diﬀerence between the logarithm
of tomorrow’s price and the logarithm of today’s price. These four pdfs must work together
with each of them providing a contribution at diﬀerent time points.
Finally, there are times that we cannot ﬁnd the best approximation function. This can
be caused by the rapid change in the data patterns. For diﬀerent time windows, there are
diﬀerent patterns thus cannot be described with one stationary model. It has been agreed
that no single model can make accurate predictions all the time.
1.5 Noneﬃciency of Financial Markets
In a classic statement on the Eﬃcient Market Hypothesis (EMH), Fama [Fam65] [Fam70]
deﬁned an eﬃcient ﬁnancial market as one in which security prices always fully reﬂects the
available information. The EMH “rules out the possibility of trading systems based only on
currently available information that have expected proﬁts or returns in excess of equilibrium
expected proﬁt or return” [Fam70]. In plain English, an average investor cannot hope to
1
Arbitrage is the simultaneous buying and selling of securities to take advantage of price discrepancies.
Arbitrage opportunities usually surface after a takeover oﬀer.
CHAPTER 1. INTRODUCTION 7
consistently beat the market, and the best prediction for a future price is the current price.
This is called a random walk of the market. If equal performance is obtained on the random
walk data as compared to the real data then no signiﬁcant predictability has been found.
The EMH was met with harsh challenges both theoretically and empirically shortly after it
was proposed.
De Bondt et. [BWT85] compare the performance of two groups of companies: extreme
losers and extreme winners. They study the stock price of two groups of companies from
1933 to 1985. They ﬁnd that the portfolios of the winners chosen in the previous three years
have relatively low returns than the loser’s portfolio over the following ﬁve years. A more
prominent example is the socalled January Eﬀect [Wac42] [Tha87]. For small companies,
there are superior returns around the end of the year and in early January than other times
of the year. This phenomenon was observed many years after Wachtel’s initial observation
and discussion. Interestingly the January eﬀect has disappeared in the last 15 years. But
this does not mean the market is fully eﬃcient as the process took a long time and there
may have evolved other new patterns that are yet to ﬁnd out.
Our own program also ﬁnds some interesting patterns. There is a super ﬁve day pattern
in the NASDAQ Composite Index from 1984 to 1990. This means the ﬁrst three days and
the last two days of each month generate a higher return in the index than any other day
of the month. The shape of the return curve thus looks like a parabola. After 1990, the
pattern became more random and less recognizable.
1.6 Objective
Financial time series consists of multidimensional and complex nonlinear data that result
in of pitfalls and diﬃculties for accurate prediction. We believe that a successful ﬁnancial
time series prediction system should be hybrid and custom made, which means it requires
a perfect combination of sophistication in the prediction system design and insight in the
input data preprocessing.
Both areas have been extensively researched. Data mining people have developed
some very useful techniques of dealing with data, such as missing values [HK01], data
multiplication [WG93]. Traders and technical analysts also have done lots of work. They
derive enhanced features of interest from the raw time series data, such as the variety of
indicators including Moving Average Convergence Divergence (MACD), Price by Volume,
CHAPTER 1. INTRODUCTION 8
60
40
20
0
20
40
60
80
0 5 10 15 20 25
r
e
t
u
r
n
trading day
daily return
Figure 1.2: A Super Five Day Pattern in the NASDAQ Composite Index from 1984 to 1990. The
ﬁrst three days and the last two days tend to have higher return than the rest of the trading days.
However, the pattern diminished in the early 1990s.
CHAPTER 1. INTRODUCTION 9
Relative Strength Index (RSI) etc. some of which have been conﬁrmed to pertain useful
information [STW99].
On the other hand, there have been several straightforward approaches that apply a
single algorithm to relatively unprocessed data, such as Back Propagation Network [CWL96],
Nearest Neighbor [Mit97], Genetic Algorithm [Deb94], Hidden Markov Models (HMM)[SW97]
and so on.
There are three major diﬃculties about accurate forecast of ﬁnancial time series. First,
the patterns of wellknown ﬁnancial time series are dynamic, i.e. there is no single ﬁxed
models that work all the time. Second, it is hard to balance between longterm trend and
shortterm sideway movements during the training process. In other words, an eﬃcient
system must be able to adjusting its sensitivity as time goes by. Third, it is usually hard
to determine the usefulness of information. Misleading information must be identiﬁed and
eliminated.
This thesis aims at solving all these three problems with a Hidden Markov Model(HMM)
based forecasting system. The HMM has a Gaussian mixture at each state as the forecast
generator. Gaussian mixtures have desirable features in that they can model series that
does not fall into the single Gaussian distribution. Then the parameters of the HMM are
updated each iteration with an adddrop ExpectationMaximization (EM) algorithm. At
each timing point, the parameters of the Gaussian mixtures, the weight of each Gaussian
component in the output, and the transformation matrix are all updated dynamically to
cater to the nonstationary ﬁnancial time series data.
We then develop a novel exponentially weighted EM algorithm that solves the second
and third diﬃculties at one stroke. The revised EM algorithm gives more weight to the
most recent data while at the same time adjust the weight according to the change in
volatility. The volatility is calculated as the divergence of the S&P Index from the Dow
Jones Composite Index. It is included as new information that help make better judgment.
Longterm and shortterm series balance is then found throughout the combined training
forecast process.
1.7 Contributions of the Thesis
This thesis is an interdisciplinary work that combines computer science and ﬁnancial engineering.
Techniques from the two areas are interweaved. The major contributions are:
CHAPTER 1. INTRODUCTION 10
1. There are few applications of Hidden Markov Model with Gaussian Mixtures in time
dependent sequence data. This is one of the ﬁrst large scale applications to ﬁnancial
time series.
2. We develop a novel dynamic training scheme. The concept of the dynamically updated
training pool enables the system to absorb newest information from data.
3. We develop an exponentially weighted EM algorithm. This algorithm realizes that
recent data should be given more attention and enables us to recognize trend changes
faster than the traditional EM algorithm, especially when training data set is relatively
large.
4. Based on the exponentially weighted EM algorithm, we propose a double weighed
EM algorithm. The double weighted EM algorithm solves the drawback of the single
exponentially weighted EM caused by oversensitivity.
5. We show the weighted EM algorithms can be written in a form of Exponentially
Moving Averages of the model variables of the HMM. This allows us to take the
advantage of this existing econometrics technique in generating the EMbased re
estimation equations.
6. Using the techniques from the EM Theorem, we prove that the convergence of the two
proposed algorithms is guaranteed.
7. Experimental results show our model consistently outperform the index fund in returns
over ﬁve 400day testing periods from 1994 to 2002.
8. The proposed model also consistently outperforms toplisted S&P 500 mutual funds
on the Sharpe Ratio.
1.8 Thesis Organization
Chapter 2 of this thesis introduces general form of HMMs. Chapter 3 describes an HMM
with Gaussian mixtures for the observation probability density functions and derives the
parameter reestimation formulas based on the classic EM algorithm. Then in chapter 5,
some detailed aspects of the HMM are studied, including multiple observation sequence
and multivariate Gaussian mixtures as pdfs. We also derive our revised exponentially
CHAPTER 1. INTRODUCTION 11
weighted EM algorithm and the new parameter updating rules in this chapter. Chapter 6
shows the experimental results and concludes the thesis. Future work that indicate possible
improvements of the model are also included in chapter 6.
1.9 Notation
We list the Roman and Greek letters used in this thesis for reference. These symbols are
used in the derivations of the reestimation formulas and proof in chapters 2, 3 and 4.
CHAPTER 1. INTRODUCTION 12
Symbol Notation
s
t
The state at time t
π
i
Probability of being at state i at time 0
a
ij
Probability of moving from state i to state j
b
j
(o
t
) Probability of being at state j and observing o
t
w
jk
Mixture weight for the k
th
component at state j
O The observation sequence
o
t
The t
th
observation
α
i
(t) Forward variable in HMM
β
i
(t) Backward variable in HMM
p
t
(i, j) Prob. of being at state i at time t, and state j at t + 1
γ
i
(t) Prob. of being at state i at time t
δ
i
(t) Prob. of the best sequence ending in state j at time t
δ
im
(t) Prob. of the best sequence ending at the m
th
component
of the i
th
state
ψ
t
The node of the incoming arc that lead to the best path
X Observed data
Y Hidden part governed by a stochastic process
Z Complete data set
η Exponential weight imposed on intermediate variables in EM
ζ Selfadjustable weight in double weighted EM
ρ The multiplier in calculating EMA
δ
im
(T) The EMA of δ
im
at time T
δ
i
(T) The EMA of δ
i
at time T
θ
j
Parameters of the j
th
component in a Gaussian mixture
Θ Parameters of a Gaussian mixture
L The likelihood function
Table 1.2: Notation list
Chapter 2
Hidden Markov Models
This chapters describes the general Markov process and Hidden Markov Models (HMMs).
We start by introducing the properties of the Markov process, then the characteristics of
the HMMs. [RJ86] and [Rab89] provide a thorough introduction to the Markov process,
Hidden Markov Models and the main problems involved with Hidden Markov Models. These
discussions form the base of our modeling and experiments with ﬁnancial time series data.
2.1 Markov Models
Markov models are used to train and recognize sequential data, such as speech utterances,
temperature variations, biological sequences, and other sequence data. In a Markov model,
each observation in the data sequence depends on previous elements in the sequence. Consider
a system where there are a set of distinct states, S = {1, 2, . . . , N}. At each discrete time
slot t, the system takes a move to one of the states according to a set of state transition
probabilities P. We denote the state at time t as s
t
.
In many cases, the prediction of the next state and its associated observation only
depends on the current state, meaning that the state transition probabilities do not depend
on the whole history of the past process. This is called a ﬁrst order Markov process. For
example, knowing the number of students admitted this year might be adequate to predict
next year’s admission. There is no need to look into the admission rate of the previous
13
CHAPTER 2. HIDDEN MARKOV MODELS 14
years. These are the Markov properties described in [MS99]:
P(X
t+1
= s
k
X
1
, . . . , X
t
) = P(X
t+1
= s
k
X
t
) (Limited horizon) (2.1)
= P(X
2
= s
k
X
1
) (Time invariant) (2.2)
Because of the state transition is independent of time, we can have the following state
transition matrix A:
a
ij
= P(X
t+1
= s
j
X
t
= s
i
) (2.3)
a
ij
is a probability, hence:
a
ij
≥ 0, ∀i, jΣ
N
j=1
a
ij
= 1. (2.4)
Also we need to know the probability to start from a certain state, the initial state
distribution:
π
i
= P(X
1
= s
i
) (2.5)
Here, Σ
N
i=1
π
i
= 1.
In a visible Markov model, the states from which the observations are produced and the
probabilistic functions are known so we can regard the state sequence as the output.
2.2 A Markov chain of stock market movement
In this section we use a Markov chain to represent stock movement. We present a ﬁve state
model to model the movement of a certain stock(See Figure 2.1).
In the ﬁgure, nodes 1, 2, 3, 4, 5 represent a possible movement vector {large rise, small
rise, no change, small drop, large drop}.
From the model in ﬁgure we are able to answer some interesting questions about how
this stock behaves over time. For example: What is the probability of seeing “small drop,
no change, small rise, no change, no change, large rise” in six consecutive days?
First we can deﬁne the state sequence as X = {1, 2, 3, 2, 2, 4}. Given the Markov chain,
the probability of seeing the above sequence, P(XA, π), can be calculated as:
P(XA, π) = P(1, 2, 3, 2, 2, 4A, π) (2.6)
= P(1)P(21)P(32)P(23)P(22)P(42) (2.7)
= π ∗ a
12
∗ a
23
∗ a
32
∗ a
22
∗ a
24
(2.8)
CHAPTER 2. HIDDEN MARKOV MODELS 15
0
uc/0.400
1
ld/0.100
2
sd/0.200
3
sr/0.200
4
lr/0.100
ld/0.100
sd/0.200
uc/0.400
sr/0.200
lr/0.100
sr/0.200
uc/0.400
lr/0.100
sd/0.200
ld/0.100
ld/0.100
sr/0.200
sd/0.200
uc/0.400
lr/0.100
uc/0.400
sd/0.200
sr/0.200
lr/0.100
ld/0.100
Figure 2.1: Modeling the possible movement directions of a stock as a Markov chain. Each of the
nodes in the graph represent a move. As time goes by, the model moves from one state to another,
just as the stock price ﬂuctuates everyday.
CHAPTER 2. HIDDEN MARKOV MODELS 16
More generally, the probability of a state sequence X
1
, . . . , X
T
can be calculated as the
product of the transition probabilities:
P(XA, π) = P(X
1
)P(X
2
X
1
)P(X
3
X
1
, X
2
) · · · P(X
T
X
1
, . . . , X
T−1
) (2.9)
= P(X
1
)P(X
2
X
1
)P(X
3
X
2
) · · · P(X
T
X
T−1
) (2.10)
=
T−1
t=1
πX
t
X
t+1
(2.11)
2.3 Hidden Markov Models
The Markov model described in the previous section has limited power in many applications.
Therefore we extend it to a model with greater representation power, the Hidden Markov
Model (HMM). In an HMM, one does not know anything about what generates the observation
sequence. The number of states, the transition probabilities, and from which state an
observation is generated are all unknown.
Instead of combining each state with a deterministic output(such as large rise, small
drop, etc), each state of the HMM is associated with a probabilistic function. At time t,
an observation o
t
is generated by a probabilistic function b
j
(o
t
), which is associated with
state j, with the probability:
b
j
(o
t
) = P(o
t
X
t
= j) (2.12)
Figure 2.2 shows the example of an HMM that models an investment professional’s
prediction about the market based on a set of investment strategies. Assume the professional
has ﬁve investment strategies, each of them gives a distinctive prediction about the market
movement (large drop, small drop, no change, small rise, and large rise) on a certain day.
Also assume he performs market prediction everyday with only one strategy.
Now how does the professional make his prediction? What is the probability of seeing
the prediction sequence {small drop, small rise} if the prediction starts by taking strategy
2 on the ﬁrst day?
To answer the ﬁrst question, think of HMM as a ﬁnancial expert. Each of the ﬁve
strategies is a state in the HMM. The movement set has all the movement symbols and each
symbol is associated with one state. Each strategy has a diﬀerent probability distribution for
all the movement symbols. Each state is connected to all the other states with a transition
probability distribution, see Figure 2.3 and Table 2.1.
CHAPTER 2. HIDDEN MARKOV MODELS 17
Strategy 4 Strategy 3 Strategy 5 Strategy 2 Strategy 1
LR
SR
UC
SD
LD
10% 15% 5%
15%
20%
40%
5%
30%
30%
15%
10%
5%
20%
40%
30%
40%
30%
20%
5%
5%
20%
20%
20%
20%
20%
Figure 2.2: The ﬁgure shows the set of strategies that a ﬁnancial professional can take. The
strategies are modeled as states. Each strategy is associated with a probability of generating the
possible stock price movement. These probabilities are hidden to the public, only the professional
himself knows it.
Unlike the Markov chain of the previous section, the HMM will pick a particular strategy
to follow one that was taken based also on the observation probability. The key to the HMM
performance is that it can pick the best overall sequence of strategies based on an observation
sequence. Introducing density functions at the states of the HMM gives more representation
power than ﬁxed strategies associated with the states.
Strategy 1
Strategy 2
Strategy 3
Strategy 4
Strategy 2
Strategy 1 Strategy 1
Strategy 2
Strategy 3
Strategy 4
Strategy 5
Strategy 3
Strategy 4
Strategy 5 Strategy 5
Day1 Day2 Day3
Figure 2.3: The trellis of the HMM. The states are associated with diﬀerent strategies and are
connected to each other. Each state produces an observation picked from the vector [LR, SR, UC,
SD, LD].
The steps for generating the prediction sequence:
CHAPTER 2. HIDDEN MARKOV MODELS 18
Probability Strategy1 Strategy2 Strategy3 Strategy4 Strategy5
Strategy1 10% 10% 20% 30% 30%
Strategy2 10% 20% 30% 20% 20%
Strategy3 20% 30% 10% 20% 20%
Strategy4 30% 20% 20% 15% 15%
Strategy5 30% 20% 20% 15% 15%
Table 2.1: State transition probability matrix of the HMM for stock price forecast. As the market
move from one day to another, the professional might change his strategy without letting others
know
1. Choose an initial state(i.e. a strategy) X
1
= i according to the initial state distribution
Π.
2. Set the time step t = 1, since we are at the ﬁrst day of the sequence.
3. Get the prediction according to the strategy of the current state i, i.e. b
ij
(o
t
). Then
we get the prediction for the current day. For example, the probability of seeing a
large rise prediction on the ﬁrst day if we take strategy 2 is 15
4. Transit to a new state X
t+1
, according to the state transition probability distribution.
5. Set t=t+1, go to step 3, if t ≤ T, otherwise terminate.
To answer the second question, sum up all the probabilities the movement sequence may
happen, through diﬀerent state sequences. We know the prediction starts with strategy 2,
which means the initial state is state 2. Then there are ﬁve possible states for the second
day. So the probability of seeing the movement sequence {small drop, small rise} is:
0.15 + 0.1 ×0.4 + 0.2 ×0.3 + 0.3 ×0.05 + 0.2 ×0.3 + 0.2 ×0.2 = 0.365
2.4 Why HMM?
From the previous example, we can see that there is a good match of sequential data analysis
and the HMM. Especially in the case of ﬁnancial time series analysis and prediction, where
the data can be thought of being generated by some underlying stochastic process that is
not known to the public. Just like in the above example, if the professional’s strategies and
CHAPTER 2. HIDDEN MARKOV MODELS 19
the number of his strategies are unknown, we can take the imaginary professional as the
underlying power that drives the ﬁnancial market up and down. As shown in the example
we can see the HMM has more representative power than the Markov chain. The transition
probabilities as well as the observation generation probability density function are both
adjustable.
In fact, what drives the ﬁnancial market and what pattern ﬁnancial time series follows
have long been the interest that attracts economists, mathematicians and most recently
computer scientists. See [ADB98].
Besides the match between the properties of the HMM and the time series data, the
Expectation Maximization(EM) algorithm provides an eﬃcient class of training methods
[BP66]. Given plenty of data that are generated by some hidden power, we can create an
HMM architecture and the EM algorithm allows us to ﬁnd out the best model parameters
that account for the observed training data.
The general HMM approach framework is an unsupervised learning technique which
allows new patterns to be found. It can handle variable lengths of input sequences, without
having to impose a “template” to learn. We show a revised EM algorithm in later chapters
that gives more weight to recent training data yet also trying to balance the conﬂicting
requirements of sensitivity and accuracy.
HMMs have been applied to many areas. They are widely used in speech recognition
[Rab89][Cha93], bioinformatics [PBM94], and diﬀerent time series analysis, such as weather
data, semiconductor malfunction [GS00][HG94],etc. Weigend and Shi [SW97] have previously
applied HMMs to ﬁnancial time series analysis. We compare our approach to theirs in
Chapter 5.
2.5 General form of an HMM
An HMM is composed of a ﬁvetuple: (S, K, Π, A, B).
1. S = {1, . . . , N} is the set of states. The state at time t is denoted s
t
.
2. K = {k
1
, . . . , k
M
} is the output alphabet. In a discrete observation density case, M
is the number of observation choices. In the above example, M equals the number of
possible movements.
CHAPTER 2. HIDDEN MARKOV MODELS 20
3. Initial state distribution Π = {π
i
}, i ∈ S. π
i
is deﬁned as
π
i
= P(s
1
= i) (2.13)
4. State transition probability distribution A = {a
ij
}, i, j ∈ S.
a
ij
= P(s
t+1
s
t
), 1 ≤ i, j ≤ N (2.14)
5. Observation symbol probability distribution B = b
j
(o
t
). The probabilistic function
for each state j is:
b
j
(o
t
) = P(o
t
s
t
= j) (2.15)
After modeling a problem as an HMM, and assuming that some set of data was generated
by the HMM, we are able to calculate the probabilities of the observation sequence and the
probable underlying state sequences. Also we can train the model parameters based on the
observed data and get a more accurate model. Then use the trained model to predict unseen
data.
2.6 Discrete and continuous observation probability distribution
2.6.1 Discrete observation probability distribution
There are ﬁxed number of possible observations in the example in section 2.3. The function
at each state is called a discrete probability density function. In ﬁnancial time series analysis,
we sometimes need to partition probability density function of observations into discrete set
of symbols v
1
, v
2
, . . . , v
M
. For example, we might want to model the direction of the price
movement as a ﬁveelement vector {large drop, small drop,no change, small rise, large rise}.
This partitioning process is called vector quantization. [Fur01] provides a survey on some of
the most commonly used vector quantization algorithms.
After the vector quantization, a codebook is created and the following analysis and
prediction are with the domain of the codebook. In the previous example, the observation
symbol probability density function will be in the form:
b
j
(o
t
) = P(o
t
= ks
t
= j), 1 ≤ k ≤ M (2.16)
CHAPTER 2. HIDDEN MARKOV MODELS 21
Using discrete probability distribution functions greatly simpliﬁes modeling of the problem.
However they have limited representation power. To solve this problem, the continuous
probability distribution functions are used during the training phase, enabling us to get as
accurate a model as possible. Then in the prediction step, the codebook is used to estimate
the probability of each possible symbol and produce the most probable one.
2.6.2 Continuous observation probability distribution
In a continuous observation probability HMM, the function b
j
(o
t
) is in the form of a
continuous probability density function(pdf) or a mixture of continuous pdfs:
b
j
(o
t
) =
M
k=1
w
jk
b
jk
(o
t
), j = 1, . . . , N (2.17)
M is the number of the mixtures and w is the weight for each mixture. The mixture
weights have the following constraints:
M
k=1
w
jk
= 1, j = 1, . . . , N; w
jk
≥ 0, j = 1, . . . , N; k = 1, . . . , M (2.18)
Each b
jk
(o
t
)is a Ddimensional logconcave or elliptically symmetric density with mean
vector µ
jk
and covariance matrix Σ
jk
:
b
j
k(o
t
) = N(o
t
, µ
jk
, Σ
jk
) (2.19)
Then the task for training is to learn the parameters of these pdfs. For details in mixtures
of Gaussian distributions, please refer to [Mit97] and [MS99].
This Ddimensional multivariate time series training requires that we pick the most
correlated time series as input. For example, in order to predict the price of IBM, it will
make more sense to include the prices of Microsoft, Sun and other high tech stocks rather
than include Exxon or McDonald’s, whose behavior have less correlation with that of IBM.
An automatic method for dimension reduction is Principal Component Analysis(PCA). The
dimension of the multivariate time series cannot be too large as it increases computation
complexity by quadratic factor, we will need to ﬁnd out the most correlated factors to our
prediction. We will describe more about this subject in Chapter 3.
Another problem is the selection of the pdf for each mixture. The most used D
dimensional logconcave or elliptically symmetric density is the Gaussian density, or Normal
Distribution.
CHAPTER 2. HIDDEN MARKOV MODELS 22
The single Gaussian density function has the form:
b
jk
(o
t
) = N(o
t
, µ
jk
, Σ
jk
) =
1
√
2πσ
exp
_
−
(x −µ)
2
2σ
2
_
(2.20)
where o
t
is a real value.
The multivariate Gaussian density is in the form:
b
jk
(o
t
) = N(o
t
, µ
jk
, Σ
jk
) =
1
(2π)
D/2
Σ
jk

1/2
exp
_
−
1
2
(o
t
−µ
jk
)
T
Σ
−1
jk
(o
t
−µ
jk
)
_
(2.21)
where o
t
is a vector.
2.7 Three Fundamental Problems for HMMs
There are three fundamental problems about the HMM that are of particular interest to us:
1. Given a model µ = (A, B, Π), and an observation sequence O = (o
1
, . . . , o
T
), how do
we eﬃciently compute the probability of the observation sequence given the model?
i.e., what is P(Oµ)?
2. Given a model µ and the observation sequence O, what is the underlying state
sequence(1, . . . , N) that best “explains” the observations?
3. Given an observation sequence O, and a space of possible models, how do we adjust
the parameters so as to ﬁnd the model µ that maximizes P(Oµ)?
The ﬁrst problem can be used to determine which of the trained models is most likely
when the training observation sequence is given. In the ﬁnancial time series case, given the
past history of the price or index movements, which model best accounts for the movement
history. The second problem is about ﬁnding out the hidden path. In our previous example
in section 2.3, what is the strategy sequence the ﬁnancial professional applied on each trading
day? The third problem is of the most signiﬁcant interest because it deals with the training
of the model. In the reallife ﬁnancial times series analysis case, it is about training the
model with the past historical data. Only after training can we use the model to perform
further predictions.
CHAPTER 2. HIDDEN MARKOV MODELS 23
2.7.1 Problem 1: ﬁnding the probability of an observation
Given an observation sequence O = (o
1
, . . . , o
T
) and an HMM µ = (A, B, Π), we want to
ﬁnd out the probability of the sequence P(Oµ). This process is also known as decoding.
Since the observations are independent of each other the time t, the probability of a state
sequence S = (s
1
, . . . , s
T
) generating the observation sequence can be calculated as:
P(OS, µ) =
T
t=1
P(o
t
s
t
, s
t+1
, µ) (2.22)
= b
s
1
(o
1
) · · · b
s
1
s
2
(o
2
) · · · b
s
T−1
s
T
(o
T
) (2.23)
and the state transition probability,
P(Sµ) = π
s
1
· a
s
1
s
2
· a
s
2
s
3
· · · a
s
T−1
s
T
. (2.24)
The joint probability of O and S:
P(O, Sµ) = P(OS, µ)P(Sµ) (2.25)
Therefore,
P(Oµ) =
S
P(OS, µ)P(Sµ) (2.26)
=
s
1
···s
T+1
π
s
1
T
t=1
a
sts
t+1
b
sts
t+1
ot
(2.27)
The computation is quite straightforward by summing the observation probabilities for
each of the possible state sequence. A major problem with this approach is enumerating each
state sequence is ineﬃciency. As the length of T of the sequence grows, the computation
grows exponentially. It requires (2T −1) · N
T+1
multiplications and N
T
−1 additions.
To solve this problem, an eﬃcient algorithm–the forward backward algorithm was developed
that cuts the computational complexity to linear in T.
The forward procedure
The forward variable α
i
(t) is deﬁned as:
α
i
(t) = P(o
1
o
2
· · · o
t−1
, s
t
= iµ) (2.28)
CHAPTER 2. HIDDEN MARKOV MODELS 24
α(t) stores the total probability of ending up in state s
i
at time t, given the observation
sequence o
1
· · · o
t−1
. It is calculated by summing probabilities for all incoming arcs at a
trellis node. The forward variable at each time t can be calculated inductively, see Figure
2.4.
t−1 t t+1
State
Time
s1
s2
s3
s n
Figure 2.4: The forward procedure
1. Initialization
α
i
(1) = π
i
, 1 ≤ i ≤ N (2.29)
2. Induction
α
j
(t + 1) =
N
i=1
α
i
(t)a
ij
b
ijot
, 1 ≤ t ≤ T, 1 ≤ j ≤ N (2.30)
CHAPTER 2. HIDDEN MARKOV MODELS 25
3. Update time Set t = t + 1; Return to step 2 if t < T;
Otherwise, terminate algorithm
4. Termination
P(Oµ) =
N
i=1
α
i
(T) (2.31)
The forward algorithm requires N(N+1)(T −1)+N multiplications and N(N−1)(T −1)
additions. The complexity is O(TN
2
).
The backward procedure
The induction computation for the forward procedure can also be performed in the
reverse order. The backward procedure calculates the probability of the partial observation
sequence from t + 1 to the end, given the model µ and state s
i
at time t. The backward
variable β
i
(t) is deﬁned as:
β
i
(t) = P(o
t+1
o
t+2
· · · o
T
s
t
= i, µ) (2.32)
The backward algorithm works from right to left on through the same trellis:
1. Initialization
β
i
(T) = 1, 1 ≤ i ≤ N (2.33)
2. Induction
β
i
(t) =
N
j=1
a
ij
b
ijot
β
j
(t + 1), 1 ≤ t ≤ T, 1 ≤ i ≤ N (2.34)
3. Update time
Set t = t −1; Return to step 2 if t > 0; Otherwise terminate algorithm.
4. Total
P(Oµ) =
N
i=1
π
i
β
i
(1) (2.35)
CHAPTER 2. HIDDEN MARKOV MODELS 26
t−1 t t+1
State
Time
s1
s2
s3
s n
Figure 2.5: The backward procedure
CHAPTER 2. HIDDEN MARKOV MODELS 27
2.7.2 Problem 2: Finding the best state sequence
The second problem is to ﬁnd the best state sequence given a model and the observation
sequence. This is one a problem often encountered in parsing speech and in pattern
recognition. There are several possible ways to solve this problem. The diﬃculty is that
there may be several possible optimal criteria. One of them is to choose the states that are
individually most likely at each time t. For each time t, 1 ≤ t ≤ T +1, we ﬁnd the following
probability variable:
γ
i
(t) = P(s
t
= iO, µ) (2.36)
=
P(s
t
= i, Oµ)
P(Oµ)
(2.37)
=
α
i
(t)β
i
(t)
N
j=1
α
j
(t)β
j
(t)
(2.38)
The individually most likely state sequence S
can be found as:
S
= argmax
1≤i≤N
γ
i
(t), 1 ≤ t ≤ T + 1, 1 ≤ i ≤ N (2.39)
This quantity maximizes the expected number of correct states. However the approach
may generate an unlikely state sequence. This is because it doesn’t take the state transition
probabilities into consideration. For example, if at some point we have zero transition
probability a
ij
= 0, the optimal state sequence found may be invalid.
Therefore, a more eﬃcient algorithm, the Viterbi algorithm, based on dynamic programming,
is used to ﬁnd the best state sequence.
The Viterbi algorithm
The Viterbi algorithm is designed to ﬁnd the most likely state sequence, S
= (s
1
, s
2
, . . . , s
T
)
given the observation sequence O = (o
1
, o
2
, . . . , o
T
) :
argmax
S
P(S
O, µ)
It is suﬃcient to maximize for a ﬁxed observation sequence O:
argmax
S
P(S
, Oµ)
The following variable:
δ
j
(t) = max
s
1
···s
t−1
P(s
1
· · · s
t−1
, o
1
· · · o
t−1
, s
t
= jµ) (2.40)
CHAPTER 2. HIDDEN MARKOV MODELS 28
This variable stores the probability of observing o
1
o
2
· · · o
t
using the most likely path
that ends in state i at time t, given the model µ. The corresponding variable ψ
j
(t) stores
the node of the incoming arc that leads to this most probable path. The calculation is done
by induction, similar to the forwardbackward algorithm, except that the forwardbackward
algorithm uses summing over previous states while the Viterbi algorithm uses maximization.
The complete Viterbi algorithm is as follows:
1. Initialization
δ
i
(1) = π
i
b
i
(o
1
), 1 ≤ i ≤ N (2.41)
ψ
i
(1) = 0, 1 ≤ i ≤ N (2.42)
2. Induction
δ
j
(t) = b
ijot
max
1≤i≤N
δ
i
(t −1)a
ij
(2.43)
ψ
j
(t) = argmax
1≤i≤N
[δ
i
(t −1)a
ij
] (2.44)
3. Update time
t = t + 1 (2.45)
Return to step 2 if t ≤ T else terminate the algorithm
4. Termination
P
∗
= max
1≤i≤N
[δ
i
(T)] (2.46)
s
∗
T
= argmax
1≤i≤N
[δ
i
(T)] (2.47)
5. Path readout
s
∗
t
= ψ
t+1
(s
∗
t+1
) (2.48)
When there are multiple paths generated by the Viterbi algorithm, we can pick up one
randomly or choose the best n paths. Fig 2.6 shows how the Viterbi algorithm ﬁnds a single
best path.
CHAPTER 2. HIDDEN MARKOV MODELS 29
State
Time
2
3
1 2 3 4
1
Figure 2.6: Finding best path with the Viterbi algorithm
2.7.3 Problem 3: Parameter estimation
The last and most diﬃcult problem about HMMs is that of the parameter estimation. Given
an observation sequence, we want to ﬁnd the model parameters µ = (A, B, π) that best
explains the observation sequence. The problem can be reformulated as ﬁnd the parameters
that maximize the following probability:
argmax
µ
P(Oµ)
There is no known analytic method to choose µ to maximize P(Oµ) but we can use a
local maximization algorithm to ﬁnd the highest probability. This algorithm is called the
BaumWelch. This is a special case of the Expectation Maximization method. It works
iteratively to improve the likelihood of P(Oµ). This iterative process is called the training
of the model. The BaumWelch algorithm is numerically stable with the likelihood non
decreasing of each iteration. It has linear convergence to a local optima.
To work out the optimal model µ = (A, B, π) iteratively, we will need to deﬁne a few
intermediate variables. Deﬁne pt(i, j), 1 ≤ t ≤ T, 1 ≤ i, j ≤ N as follows:
CHAPTER 2. HIDDEN MARKOV MODELS 30
pt(ij) = P(s
t
= i, s
t+1
= jO, µ) (2.49)
=
P(s
t
= i, s
t+1
= j, Oµ)
P(Oµ)
(2.50)
=
α
i
(t)a
ij
b
ijot
β
j
(t + 1)
N
m=1
α
m
(t)β
m
(t)
(2.51)
=
α
i
(t)a
ij
b
ijot
β
j
(t + 1)
N
m=1
N
n=1
α
m
(t)a
mn
b
mnot
β
n
(t + 1)
(2.52)
This is the probability of being at state i at time t, and at state j at time t + 1, given
the model µ and the observation O.
Then deﬁne γ
i
(t). This is the probability of being at state i at time t, given the
observation O and the model µ:
γ
i
(t) = P(s
t
= iO, µ) (2.53)
=
N
j=1
P(s
t
= i, s
t+1
= jO, µ) (2.54)
=
N
j=1
p
t
(i, j) (2.55)
The above equation holds because γ
i
(t) is the expected number of transition from state
i and p
t
(i, j) is the expected number of transitions from state i to j.
Given the above deﬁnitions we begin with an initial model µ and run the training data
O through the current model to estimate the expectations of each model parameter. Then
we can change the model to maximize the values of the paths that are used. By repeating
this process we hope to converge on the optimal values for the model parameters.
CHAPTER 2. HIDDEN MARKOV MODELS 31
The reestimation formulas of the model are:
π
i
= probability of being at state i at time t = 1
= γ
i
(1) (2.56)
a
ij
=
expected number of transitions from state i to j
expected number of transitions from state i
=
T
t=1
p
t
(i, j)
T
t=1
γ
i
(t)
(2.57)
b
ijk
=
expected number of transitions from i to j with k observed
expected number of transitions from i to j
=
t:ot=k,1≤t≤T
p
t
(i, j)
T
t=1
p
t
(i, j)
(2.58)
In the following calculations, we will replace b
ijot
with b
jot
, i.e. each observation is
generated by the probability density function associated with one state at one time step.
This is also called a ﬁrst order HMM.
2.8 Chapter Summary
In this chapter, the general form of the HMM is introduced, both theoretically and with
examples. The conditions that HMMs must observe, i.e. the Markov properties are also
described. Then we studied the three basic problems involved with any HMM. The following
model design and experiments are all extensions based on the basic model and the techniques
to solve the three problems. In fact, in the prediction task, both Viterbi algorithm and the
BaumWelch algorithm are used in combination to make the system work.
Chapter 3
HMM for Financial Time Series
Analysis
In this chapter we construct an HMM based on the discussions in chapter 2. We ﬁrst give
a brief survey of Weigend and Shi’s previous work on the application of Hidden Markov
Experts to ﬁnancial time series prediction [SW97], analyze problems with their approach
and then propose a new model.
We use an HMM with emission density functions modeled as mixtures of Gaussians in
combination with the usual transition probabilities. We introduce Gaussian mixtures which
we have as the continuous observation density function associated with each state. The
reestimation formulas are then derived for the Gaussian Mixtures. We apply an adddrop
approach to the training data.
3.1 Previous Work
In [SW97], Weigend and Shi propose a Hidden Markov Expert Model to forecast the change
of the S&P 500 Index at halfhour time step. The “experts” at each state of the Hidden
Markov Model applied for ﬁnancial time series prediction are three neural networks, with
characteristics and predicting bias distinct from each other. One of the experts is good at
forecasting low volatility data, the second expert is best for high volatility regions, and the
32
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 33
third expert is responsible for the “outliers”.
1
The training of the Hidden Markov Experts mainly involves the reestimation of the
transition probability matrix A. Just like we did in equations (2.58) and (2.59). But there
are no reestimations for each of the three “experts”. This means that the experts are ﬁxed
and we must trust them during the whole prediction process because there is no way of
improving their accuracy of prediction. This is problematic because even if we can divide
a long time series into several segments with diﬀerent volatilities, segments with similar
volatilities may not necessarily reﬂect similar pattern. In the later sections of this chapter
we will introduce the Gaussian mixtures where all parameters of each mixture component,
the counterpart of the “expert”, are adjustable.
Then in the prediction phase, a training pool is created. This pool keeps updating by
adding the last available data and eliminating the ﬁrst data in the pool. The data in each
pool is then trained together to generate the prediction for the next coming day. In this
approach, data at each time step are of equal importance. This can cause big problems
during highly volatile period or at the turning point of long term trends. Too much weight
is given to the early samples. We will solve this problem with an adjustable weighted EM
algorithm in chapter 4.
Another problem with Shi and Weigend’s model is the lack of additional information to
help prediction decisions. Their model takes the destination time series as the only input
time series. By providing more available information, such as similar time series or indicator
series derived from the original series, we can make the model more reliable. In section 3.3
we introduce the multivariate Gaussian mixtures that take a vector composed of the Dow
Jones Composite Index and the S&P 500 Index as input. Without having to worry about
the exactly mathematical relationship between these two series, we let the HMM combine
them together to make predictions for the S&P 500 Index. Then in chapter 4, the Dow
Jones Index is used as an error controller. When the system prediction of the S&P 500
doesn’t coincide with the Dow, we will give less conﬁdence in the HMM. This is because
the Dow is a less volatile index than the S&P 500. By comparing the movement prediction
of the S&P 500 with the Dow, we can balance between long term trend and short
2
term
1
Outliers are a few observations that are not well ﬁtted by the “best” available model. Refer to appendix
I for more explanation.
2
To short means to sell something that you don’t have. To do this, you have to borrow the asset that
you want to short.
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 34
whipsaws.
3
3.2 BaumWelch Reestimation for Gaussian mixtures
The reestimation formulas in the previous chapter have mainly focused on the case where
the symbol emission probabilities are discrete, in other words, the emitted symbols are
discrete symbols from a ﬁnite alphabet list. Although as mentioned before, we can use
the vector quantization technique to map each continuous value into a discrete value via
the codebook, there might be serious degradation after the quantization. Therefore it is
advantageous to use HMMs with continuous observation densities.
The most generally used form of the continuous probability density function(pdf) is the
Gaussian mixture in the form of:
b
im
(o
t
) =
M
m=1
w
im
N(o
t
, µ
im
, Σ
im
) (3.1)
o
t
is the observed vector, w
im
is the weight of the mth Gaussian mixture at state i. The
mixture weights w must satisfy the following constraints:
M
m=1
w
im
= 1, 1 ≤ i ≤ N (3.2)
w
im
≥ 0, 1 ≤ i ≤ N, 1 ≤ m ≤ M
N(O, µ
im
, Σ
im
) is a multivariate Gaussian distribution with mean vector µ
and covariance
matrix Σ
:
N(O, µ
im
, Σ
im
) =
1
_
(2π)
n
Σ

exp
_
−
1
2
(o −µ
)Σ
−1
(o −µ
)
_
(3.3)
The learning of the parameters of the HMMs thus becomes learning of the means and
variances of each Gaussian mixture component. The intermediate variable δ is redeﬁned as:
δ
i
(t) =
M
m=1
δ
im
(t) =
M
m=1
1
P
α
j
(t −1)a
ji
w
im
b
im
(o
t
)β
i
(t) (3.4)
As previously deﬁned M is the total number of Gaussian mixture components at state
i and N is the total number of states. The diﬀerence with the previous example is that
3
A whipsaw occurs when a buy or sell signal is reversed in a short time. Volatile markets and sensitive
indicators can cause whipsaws
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 35
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 5 10 15 20 25
G1: 20%
G2: 30%
G3: 50%
Figure 3.1: Three single Gaussian density functions. Each of them will contributing a percentage
to the combined output in the mixture.
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 36
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 5 10 15 20 25
mixture of the three Gaussians
Figure 3.2: A Gaussian mixture derived from the three Gaussian densities above. For each small
area δ close to each point x on the x axis, there is a probability of 20% that the random variable x
is governed by the ﬁrst Gaussian density function, 30% of probability that the distribution of x is
governed by the second Gaussian, and 50% of probability by the third Gaussian.
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 37
rather than associating each state with one probability density function, each observation
is generated by a set of Gaussian mixtures, with each mixture component contributing to
the overall result by some weight.
Then we also need a variable that denotes the probability of being at state j at time t
with the kth mixture component that accounts for the current observation o
t
:
γ
j,k
(t) =
α
j
(t)β
j
(t)
N
j=1
α
j
(t)β
j
(t)
·
w
jk
N(o
t
, µ
jk
, Σ
jk
)
M
k=1
w
jk
N(o
t
, µ
jk
, Σ
jk
)
(3.5)
With the new intermediate variable δ deﬁned, we are able to derive the reestimation
formulas below:
µ
im
=
T
t=1
δ
im
(t)o
t
T
t=1
δ
im
(t)
(3.6)
σ
im
=
T
t=1
δ
im
(t)(o
t
−µ
im
)(o
t
−µ
im
)
T
t=1
δ
im
(t)
(3.7)
w
im
=
T
t=1
δ
im
(t)
T
t=1
δ
i
(t)
(3.8)
a
ij
=
T
t=1
α
i
(t)a
ij
b
j
(o
t+1
)β
j
(t + 1)
T
t=1
α
i
(t)β
i
(t)
(3.9)
3.3 Multivariate Gaussian Mixture as Observation Density
Function
Sometimes we have observation vectors that have more than one element. In other words the
observed data are generated by several distinct underlying causes that have some unknown
relationship between them. Each cause contributes independently to the generation process
but we can only see the observation sequence as a ﬁnal mixture, without knowing how much
and in what way each cause contributes to the resulted observation sequence.
This phenomenon is of great interest in ﬁnancial time series analysis. Many time series
could be correlated with other time series. For example in stock market, the volume by itself
is of little interest, but it will contain plenty of information about the trend of the market
when combined with close price. Canada’s stock market is aﬀected by the US economy
circle because Canada is the largest trading partner of US. As such the factors that aﬀect
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 38
US stock market also have an impact on the Canadian market. Researchers have found that
the best Canadian year is the third year of the US president election cycle.
Some day traders have used a technique called “pairedtrading”. They trade two stocks
with similar characteristics at the same time, aiming at proﬁting from arbitrage opportunities
or averse possible risks. If they buy the stocks of WalMart, they will short the stocks of
Kmart. These examples show that there is a need for including relevant information beyond
the time series that we are studying.
3.3.1 Multivariate Gaussian distributions
The multivariate mdimensional Gaussian has the same parameters as the single normal
distribution, except that the mean is in the form of a vector µ and the covariance is in the
form of a m× m covariance matrix Σ
j
. The probability density function for a Gaussian is
given by:
n
j
(x, m
j
, Σ
j
) =
1
_
(2π)
m
Σ
j

exp
_
−
1
2
(x − µ
j
)
T
Σ
−1
j
(x − µ
j
)
_
(3.10)
At each state, we assume the observation is generated by k Gaussians, so the overall
output is:
k
j=1
w
j
N(x, µ
j
, Σ
j
) (3.11)
The initial value of the weights w
j
for each mixture component should sum up to 1.
The training process then becomes a process of ﬁnd the parameters that maximizes the
probability of this output observation.
3.3.2 EM for multivariate Gaussian mixtures
We now develop the EM algorithm for reestimating the parameters of the multivariate
Gaussian mixture. Let θ
j
= ( µ
j
, Σ
j
, π
j
) be the jth mixture component, then the parameters
of the model are deﬁned as Θ = (θ
1
, . . . , θ
k
)
T
. Also deﬁne a hidden variable E[z
ij
] as the
probability of generating the ith observation with the jth mixture. The EM algorithm that
ﬁnds the parameters that maximize the probability of the observation sequence O consists
of the following two steps (we use a mixture of two θ
1
, θ
2
to explain the algorithm):
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 39
Estimation step:
Calculate the expected value E[z
ij
] of each hidden variable z
ij
, given that we know the
current values of Θ = (θ
1
, θ
2
).
Maximization step:
Calculate a new maximum likelihood value of Θ
= (θ
1
, θ
2
), assuming the value taken on by
each hidden variable z
ij
is the expected value in the ﬁrst step.
In the HMM context, the probability of the hidden variable z
ij
is the intermediate
variable γ
ij
(t) we talked about in 3.2. Combining the EM algorithm for multivariate
Gaussian mixtures and the EM algorithm for HMMs, we develop the following formulas
for each of the parameters of the HMM with multivariate Gaussian mixtures as observation
density function:
µ
im
=
T
t=1
δim(t) o
t
T
t=1
δ
im
(t)
(3.12)
Σ
j
=
T
t=1
δ
im
(t)( o
i
−
µ
j
)( o
i
−
µ
j
)
T
T
t=1
δ
im
(t)
(3.13)
w
im
=
T
t=1
δ
im
(t)
T
t=1
δ
i
(t)
(3.14)
a
ij
=
T
t=1
α
i
(t)a
ij
b
j
(o
t+1
)β
j
(t + 1)
T
t=1
α
i
(t)β
i
(t)
(3.15)
3.4 Multiple Observation Sequences
In the previous discussions, the EM algorithm is applied to a single observation sequence O.
However, the reallife ﬁnancial time series data can have a very long range. Modeling them
as one single observation sequence might give low recognition rate when testing data from
some speciﬁc period of time. This is because the pattern or trend at some point of time
may be quite diﬀerent to that of the overall pattern. As stated by Graham [Gra86]: “There
is no generally known method of chart reading that has been continuously successful for a
long period of time. If it were known, it would be speedily adopted by numberless traders.
This very following would bring its usefulness to an end.”
Rabiner oﬀers a way to solve the similar kind of problem in speech recognition by splitting
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 40
the original observation sequence O into a set of k short sequences [Rab89]:
O = [O
(1)
, O
(2)
, . . . , O
(k)
] (3.16)
where O
(k)
= [O
(k)
1
, O
(k)
2
, . . . , O
(k)
T
k
] is the kth observation sequence and T
k
is the kth
observation sequence length. Assume that each observation sequence is independent of each
other, the probability function P(Oµ) to maximize now becomes:
P(Oµ) =
K
k=1
P(O
(k)
µ) (3.17)
The variables α
i
(t), β
i
(t), p
t
(i, j) and δ
i
(t) are calculated for each training sequence O
(k)
.
Then the reestimation formulas are modiﬁed by adding together all the probabilities for
each sequence. In this way the new parameters can be vowed as averages of the optima given
by the individual training sequences. The reestimation formulas for multiple observation
sequences are:
µ
im
=
K
k=1
1
P
k
Tr
t=1
δ
(k)
im
(t)o
(k)
t
K
k=1
1
P
k
Tr
t=1
δ
(k)
im
(t)
(3.18)
σ
im
=
K
k=1
1
P
k
Tr
t=1
δ
(k)
im
(t)(o
(k)
t
−µ
im
)(o
(k)
t
−µ
im
)
K
k=1
1
P
k
Tr
t=1
δ
(k)
im
(t)
(3.19)
w
im
=
K
k=1
1
P
k
Tr
t=1
δ
(k)
i
(t)
K
k=1
1
P
k
Tr
t=1
M
m=1
δ
(k)
im
(t)
(3.20)
a
ij
=
K
k=1
1
P
k
Tr−1
t=1
α
(k)
i
(t)a
ij
b
j
(o
(k)
t+1
)β
(k)
j
(t + 1)
K
k=1
1
P
k
Tr−1
t=1
α
(k)
i
(t)β
(k)
i
(t)
(3.21)
Rabiner’s algorithm is used to process a batch of speech sequences that are of equal
importance, so he includes all the training sequences in the training process. In our ﬁnancial
time series, often cases are that patterns change from time to time. In order to capture the
most recent trend, we create a dynamic training pool. The ﬁrst data in the pool are popped
out and new data are pushed into it when a trading day ends. Each time the training starts
from the parameters of last day’s trading yet with today’s data.
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 41
Figure 3.3: The dynamic training pool
3.5 Dynamic Training Pool
We introduce a new concept in the training process called the training pool. This is a
segment of data that is updated dynamically throughout the training. At the end of a
trading day, when the market closes, we can calculate the return for that day and add it to
the pool. In the meantime, drop the oldest data in the pool. The training is then performed
again based on the parameters from last training and the data from the new pool. This
way the system always trains on fresh data. In the training pool each data sample serves
as both prediction target and training sample. See ﬁgure 3.3 for a graphical illustration of
the dynamic training pool.
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 42
3.6 Making Predictions with HMM
3.6.1 Setup of HMM Parameters
We construct a HMM with Gaussian Mixtures as observation density function(call it HMGM).
The HMM has 5 states. Each state is associated with a Gaussian mixture that has 4
Gaussian probability distributions. The HMM is a fully connected trellis, each of the
transition probability is assigned a real value at random between 0 and 0.2. The initial
state probability distribution probabilities Π = {π
i
} are assigned a set of random values
between 0 and 0.2. And there are a set of random values ranging between 0 and 0.25 for
the initial mixture weights w. The initial values of the covariance for each of the Gaussian
probability distribution function is assigned 1. And the mean of each Gaussian mixture
component is assigned a real value between 0.3 and 0.3. This is the normal range of
possible returns for each day.
Intuitively, the number of states is the number of the strategies we have to make
predictions. This number should be neither too large nor too small. Too many strategies
will make little distinction between each other and will increase the cost of computing.
On the other hand if the number of states is too small, risk of misclassiﬁcation will
increase. This can also be explained intuitively using ﬁnancial theory: the more diversiﬁed
the strategies, the less investment risk and the less return. The number of the mixture
components follow the same discussion. We have launched experiments with diﬀerent state
and mixture numbers. Based on our experience, 5 states and 4 mixture components are
reasonable numbers for the model. The architecture of the HMGM is shown in Fig 3.4.
This setting is similar to Shi and Weigend’s experimental environment where they have
one neural net expert associated with each of the three states. See [WMS95] and [SW97].
The optimal number of states can be decided through experiments.
The input into the HMGM is the real number vector introduced in section 1.3. In the
following section we will describe how the real number is transformed into long or short
signals.
3.6.2 Making predictions
We now come to the core part of the problem: making predictions before each trading day
starts. Actually the prediction process is carried out together with the training at the same
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 43
G
1
G
2
G
3
G
4
State
Time
s1
s2
s3
s4
s5
Mixtures
Figure 3.4: Hidden Markov Gaussian Mixtures
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 44
time. In other words, at the end of each trading day, when the market return of that day
is known to the public, we include it in the training system to absorb the most updated
information. This is exactly what human analysts and professional traders do. Many of
them work hours after the market closes 4pm Atlantic time. The HMGM is then trained
again with the newest time series and the parameters are updated.
When the training is done and parameters of the HMGM are updated, we performed
the Viterbi algorithm to ﬁnd the most “reliable” Gaussian mixtures associated with the
most probable state. The weighted average of the mean of the Gaussian mixture is taken
as the prediction for tomorrow. The induction step in section 2.7.2 is revised by replacing
the function b
ijot
with the Gaussian density function N(o
t
, µ
jm
, Σ
jm
) for each state j and
mixture component m:
δ
j
(t) = max
1≤i≤N
δ
i
(t −1)a
ij
M
m=1
w
jm
N(o
t
, µ
jm
, Σ
jm
) (3.22)
ψ
jk
(t) = argmax
1≤i≤N
δ
i
(t −1)a
ij
M
m=1
w
jm
N(o
t
, µ
jm
, Σ
jm
) (3.23)
Next we transform the real values generated by the Gaussian mixtures into binary
symbols–positive or negative signals of the market. If the weighted mean of the Gaussian
mixtures at the selected state is positive or zero, we output a positive signal, otherwise we
output a negative signal. In other words, we have a threshold at zero. In our experiment,
a weighted mean of zero will generate a positive signal, that will further result in a long
strategy.
Algorithm 3.1 is the revised multiobservation EM algorithm for HMGM written in
the pseudo code.
The training set can be of any size, however as most ﬁnancial analysts believe, a ﬁve
day or twenty day set would be desirable because ﬁve days are the trading days in a week
and one month consists of roughly twenty trading days. There is a balance between the size
of the training set and the sensitivity of the model. The bigger the training set, the more
accurate the model in terms of ﬁnding long term trend. However the trained model may
not be sensitive to whipsaws.
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 45
Algorithm 3.1: Combined trainingprediction algorithm for the HMGM
while not End of Series do
initialization
if tradingDay = ﬁrst day of series then
assign random values to the HMM parameters
else
initialize parameters with the last updated values
end if
Training data set update
1.State drop day 1 in the previous training set
2.State add current day (the day that has just ended) to the training set
3.Update all parameters using the EM algorithm
Prediction
1.Run the Viterbi algorithm to ﬁnd the Gaussian mixture with the most probable state
2.Compare the weighted mean of the Gaussian mixture against the threshold zero
3.Output the predicted signal
end while
3.7 Chapter Summary
In this chapter, we construct an HMM to predict the ﬁnancial time series, more speciﬁcally
the S&P 500 Index. Issues considered in the modeling process include the selection of the
emission density function, the EM update rule and characteristics of the selected ﬁnancial
time series. These issues are addressed throughout the explanations in this chapter. Each
state of the HMM is associated with a mixture of Gaussian functions with diﬀerent parameters,
which give each state diﬀerent prediction preference. Some states are more sensitive to
low volatile data and others are more accurate with high volatile segments. The revised
weighted EM update algorithm puts more emphasis on the recent data. The Dow Jones
Index is included in the experiments as a supporting indicator to decide the volatility of the
S&P 500 series. We have achieved impressive results that are shown in chapter 5.
Chapter 4
Exponentially Weighted EM
Algorithm for HMM
In this chapter we ﬁrst review the traditional EM algorithm in more depth by taking a close
look at the composition of the likelihood function and the Q function. Originally formalized
in [DLR77], The EM algorithm for reestimation is intensively studied under all sorts of
problem settings, including Hidden Markov Model [BP66] [RJ86] [Rab89], conditional/joint
probability estimation [BF96] [JP98], constrained topology HMM [Row00], coupled HMM
[ZG02], hierarchical HMM [FST98], and Gaussian mixtures [RW84] [XJ96] [RY98]. However,
all these variances of the EM algorithm do not take the importance attenuation in the time
dimension into consideration. They take it for granted that each data sample in the training
data set is of equal importance and the ﬁnal estimation is based on this equal importance
assumption, no matter how big the training set is. This approach works well when the
training set is small and the data series studied are not timesensitive or the time dimension
is not a major concern. The traditional EM based HMM models have found great success
in language and video processing where the order of the observed data is not signiﬁcantly
important.
However ﬁnancial time series diﬀer from all the above signals. They have some unique
characteristics. We need to focus more on recent data while keep in mind of the past
patterns, but at a reduced conﬁdence. As we have discussed in chapter 1, many patterns
disappeared or diminished after they had been discovered for a while. A most persuasive
argument is the change of the Dow Jones Index. It took the Dow Jones Index 57 years to
46
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 47
rise from 200 to 2000, yet it only took 12 years to rise from 2000 to 10, 000.
Realizing the change of importance over the time dimension and motivated by the
concept of Exponential Moving Average (EMA) of ﬁnancial time series, we presents the
derivation of the exponentially weighted EM algorithm. We show that the ﬁnal reestimation
of the parameters of the HMM and the Gaussian densities can be expressed in a form of a
combination of the exponential moving averages (EMAs) of the intermediate state variables.
To ﬁnd better balance between shortterm and longterm trends, we further develop a double
weighted EM algorithm that is sensitivityadjustable by including the volatility information.
We show the convergence of the exponentially weighted EM algorithm is guaranteed by a
weighted version of the EM Theorem.
4.1 Maximumlikelihood
We have a density function p(XΘ) that represents the probability of the observation
sequence X = {x
1
, . . . , x
N
} given the parameter Θ. In the HMGM model Θ includes
the set of means and covariances of the Gaussian mixtures associated with each state as
well as the transition probability matrix and the probability of each mixture component.
Assuming the data are independent and identically distributed (i.i.d.) with distribution p,
the probability of observing the whole data set is:
p(XΘ) =
N
i=1
p(x
i
θ) = L(ΘX) (4.1)
The function L is called the likelihood function of the parameters given the data. Our goal
of the EM algorithm is to ﬁnd the Θ
∗
that maximizes L:
Θ
∗
= argmax
Θ
L(ΘX) (4.2)
Since the log function is monotonous, we often maximize the logarithm form of the likelihood
function because it reduces multiplications to additions thus makes the derivation analytically
easier.
4.2 The Traditional EM Algorithm
The EM algorithm takes the target data as composed of two parts, the observed part and
the underlying part. The observed data are governed by some distribution with given form
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 48
but unknown parameters. The underlying data can be viewed as a random variable itself.
The task of the EM algorithm is to ﬁnd out the parameters for the distribution governing
the observed part and the distribution generating the underlying part.
Now assume X as the observed part governed by some distribution with known form
but unknown parameters. X is called the incomplete data. Then Assume the complete data
Z = (X, Y ). By the Bayes rule, the joint density function of one data in the complete data
sample is:
p(zΘ) = p(x, yΘ)
= p(yx, Θ)p(xΘ) (4.3)
x and y are individual data in the data sample X and Y . Note the a prior probability
p(xΘ) and the distribution of the hidden data are initially generated by assigning some
random values. We now deﬁne the complete data likelihood function for the overall data set
based on the above joint density function of individual data:
L(ΘZ) = L(ΘX, Y )
= p(X, Y Θ) (4.4)
When estimating the hidden data Y , X and Θ can be taken as constants and the complete
data likelihood becomes a function of Y : L(ΘX, Y ) = h
X,Θ
(Y ) where Y is the only random
variable.
Next the EM algorithm ﬁnds the expected value of the complete data loglikelihood
log p(X, Y Θ) with respect to the unknown data Y , given the observed part X and the
parameters from last estimation. We deﬁne a Q function to denote this expectation:
Q(Θ, Θ
(i−1)
) = E
_
log p(X, Y Θ)X, Θ
(i−1)
_
=
_
y∈Υ
log p(X, yΘ)f(yX, Θ
(i−1)
) dy (4.5)
Θ
(i−1)
are the estimates from last run of the EM algorithm. Based on Θ
(i−1)
we try to
ﬁnd the current estimates Θ that optimize the Q function. Υ is the space of possible values
of y.
As mentioned above, we can think of Y as a random variable governed by some distribution
f
Y
(y) and h(Θ, Y ) is a function of two variables. Then the expectation can be rewritten in
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 49
the form:
Q(Θ) = E
Y
[h(Θ, Y )]
=
_
y
h(Θ, Y )f
Y
(y) dy (4.6)
In the MStep of the EM algorithm, we ﬁnd the parameter Θ that maximizes the Q
function:
Θ
(i)
= argmax
Θ
Q(Θ, Θ
(i−1)
) (4.7)
The two steps are repeated until the probability of the overall observation does not
improve signiﬁcantly. The EM algorithm is guaranteed to increase the loglikelihood at
each iteration and will converge to a local maximum of the likelihood. The convergence is
guaranteed by the EM Theorem [Jel97].
4.3 Rewriting the Q Function
An assumption that is not explicitly stated in the derivation of the EM algorithm in all
works cited above is that: in calculating the Q function, we take it for granted that each
data is of equal importance in the estimation of the new parameters. This is reasonable
in problems like speech and visual signal recognition because the order of the data in the
observed data sample is not that important, rather the general picture consisting of all data
points is of more interest. However as stated in the beginning of this chapter, ﬁnancial time
series have characteristics which make the time dimension very important. As time goes
on, old patterns attenuate and new patterns emerge. If the local data set that we use as
training set is very large, the new emerged patterns or trends will be inevitably diluted.
Therefore we need a new EM algorithm that gives more importance to more recent data.
To achieve this goal, we must ﬁrst rewrite the Q function. Recall that the Q function is
the expectation of the complete data likelihood function L with respect to the underlying
data Y , given the observed data X and the last parameter estimates Θ. Now we include a
timedependent weight η in the expectation to reﬂect information on the time dimension.
The revised expectation of the likelihood function
ˆ
Q:
ˆ
Q(Θ, Θ
(i−1)
) = E
_
log ηp(X, Y Θ)X, Θ
(i−1)
_
=
_
y∈Υ
log ηp(X, yΘ)f(yX, Θ
(i−1)
) dy (4.8)
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 50
Since the logarithm function is monotonous and 0 < η ≤ 1, we can take η out of the log
function.
[log η · P(X, yΘ) ≤ log η · P(X, yΘ
g
)] ⇐⇒
[η · P(X, yΘ) ≤ η · P(X, yΘ
g
)] ⇐⇒
[η · log P(X, yΘ) ≤ η · log P(X, yΘ
g
)] (4.9)
This move can bring great convenience to the E step. The
ˆ
Q function then becomes:
ˆ
Q(Θ, Θ
(i−1)
) = E
_
η log p(X, Y Θ)X, Θ
(i−1)
_
=
_
y∈Υ
η log p(X, yΘ)f(yX, Θ
(i−1)
) dy (4.10)
The weight η is actually a vector of real values between 0 and 1: η = {η
1
, η
2
, . . . , η
N
}
where N is the number of data. The whole point is that the likelihood of each data is
discounted by a conﬁdence density. The weighted expectation function can be understood
intuitively: Given the last parameter estimates and the observed data, if the expected
probability of observing one data is 40%, given its position in the whole data sequence, we
might have only 60% conﬁdence in the accuracy of this result–meaning the ﬁnal expected
likelihood is only 0.4 ×0.6 = 24%.
In a later section we will show that the
ˆ
Q function converges to a local maximum using
techniques from the EM Theorem.
4.4 Weighted EM Algorithm for Gaussian Mixtures
We now show how the parameters of the HMGM introduced in chapter 3 can be reestimated
with the weighted EM algorithm. To begin with, we will use a mixture of Gaussians as an
example to illustrate the derivation process, the HMGM is only a special case of Gaussian
mixtures in nature.
Under this problem setting, ﬁrst assume the following density for one single data x
j
,
given the parameter Θ of the Gaussian mixture:
p(x
j
Θ) =
M
i=1
η
j
α
i
p
i
(xθ
i
) (4.11)
The parameters now consist of three types of variables Θ = (η
j
, α
1
, . . . , α
M
, θ
1
, . . . , θ
M
). η
j
is
the timedependent conﬁdence weight for the j
th
observed data. α
i
indicates which mixture
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 51
component generates the observed data and it must observe the constraint
M
i=1
α
i
= 1. p
i
is a density function for the i
th
mixture component with parameter θ
i
.
We assume a set of initial parameter values, for example y
i
= k indicates the i
th
data
is generated by the k
th
mixture component. Also assume θ
i
as the parameters for the i
th
Gaussian function of the mixture. (θ
i
includes the mean and covariance for the i
th
Gaussian).
Last assume the conﬁdence weight η
n
for the n
th
data. We can obtain the complete data
log likelihood function based on these hypothesized parameters:
log
ˆ
L(ΘX, Y ) = η log P(X, Y Θ)
=
N
i=1
η
i
log P(x
i
y
i
)P(y)
=
N
i=1
η
i
log α
y
i
p
y
i
(x
i
θ
y
i
) (4.12)
So we start from a set of guessed parameters and from the assumed parameters we can go
on to compute the likelihood function. We guess that Θ
g
=
_
η
1
, . . . , η
N
; α
g
1
, . . . , α
g
M
; θ
g
1
, . . . , θ
g
M
_
are the best possible parameters for the likelihood
ˆ
L(Θ
g
X, Y ). Note the subscripts of η are
in the time dimension, there are N data in the training set. The subscripts for α and θ are
in the spatial dimension, indicating the parameter of a certain mixture component. η
i
and
α
y
i
can be thought of as prior probabilities. And p
y
i
(y
i
x
i
, Θ
g
) in the likelihood function
can be computed using the Bayes’s rule:
p(y
i
x
i
, Θ
g
) =
α
g
y
i
p
y
i
(x
i
θ
g
y
i
)
p(x
i
Θ
g
)
=
α
g
y
i
p
y
i
(x
i
θ
g
y
i
)
M
k=1
α
g
k
p
k
(x
i
θ
g
k
)
(4.13)
and same as the basic EM algorithm,
p(yX, Θ
g
) =
N
i=1
p(y
i
x
i
, Θ
g
) (4.14)
where y = (y
1
, . . . , y
N
) retain the independent property.
As mentioned above the conﬁdence weight η can be taken as a prior probability. The
p
y
i
(y
i
x
i
, Θ
g
) is actually a conditional probability conditional on the data position i. However
this conditional probability is often ignored in traditional EM algorithm because it was taken
for granted that the prior probability is always equal to 1.
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 52
With these initial parameters Θ
g
made by guessing, we can try to ﬁnd out the current
parameters Θ that maximizes the
ˆ
Q function:
ˆ
Q(Θ, Theta
g
) =
y∈Υ
η log(L(ΘX, y))p(yX, Θ
g
)
=
y∈Υ
η log P(X, yΘ)p(yX, Θ
g
)
=
y∈Υ
N
i=1
η
i
log(α
y
i
p
y
i
(x
i
θ
y
i
))
N
j=1
p(y
j
x
j
,
ˆ
Θ
g
)
=
M
y
1
=1
M
y
2
=1
· · ·
M
y
N
=1
N
i=1
log(α
y
i
p
y
i
(x
i
θ
y
i
))
N
j=1
p(y
j
x
j
,
ˆ
Θ
g
)
=
M
y
1
=1
M
y
2
=1
· · ·
M
y
N
=1
N
i=1
M
l=1
δ
l,y
i
log(α
l
p
l
(x
i
θ
l
))
N
j=1
p(y
j
x
j
,
ˆ
Θ
g
)
=
M
l=1
N
i=1
log(α
l
p
l
(x
i
θ
l
))
M
y
1
=1
M
y
2
=1
· · ·
M
y
N
=1
δ
l,y
i
N
j=1
p(y
j
x
j
,
ˆ
Θ
g
) (4.15)
The latter part of the above formula can be simpliﬁed as below:
M
y
1
=1
M
y
2
=1
· · ·
M
y
N
=1
δ
l,y
i
N
j=1
p(y
j
x
j
,
ˆ
Θ
g
)
=
_
_
M
y
1
=1
· · ·
M
y
i−1
=1
M
y
i+1
=1
· · ·
M
y
N
=1
N
j=1,j=i
p(y
j
x
j
, Θ
g
)
_
_
p(lx
i
, Θ
g
)
=
N
j=1,j=i
_
_
M
y
j
=1
p(y
j
x
j
, Θ
g
)
_
_
p(lx
i
, Θ
g
)
= p(lx
i
, Θ
g
) (4.16)
This simpliﬁcation can be carried out because of the constraint
M
i=1
p(ix
j
, θ
g
) = 1.
Although the overall expectation is discounted by the time factor, the constraint on mixture
components still holds at each data point over the time axis. Then the
ˆ
Q function can be
written as:
ˆ
Q(Θ, Θ
g
) =
M
l=1
N
i=1
η
i
log(α
l
p
l
(x
i
θ
l
))p(lx
i
, Θ
g
)
=
M
l=1
N
i=1
η
i
log(α
l
)p(lx
i
, Θ
g
) +
M
l=1
N
i=1
η
i
log(p
l
(x
i
Θ
l
))p(lx
i
, Θ
g
) (4.17)
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 53
It becomes clear that to maximize the
ˆ
Q function we must maximize both terms in the
above formula separately. The ﬁrst term has α
l
which must observe the constraint
l
α
l
= 1.
So it is a constrained function optimization problem where the Lagrange multiplier λ can
be applied. And the maximum is achieved when the following equation is satisﬁed:
∂
∂α
l
_
M
l=1
N
i=1
η
i
log(α
l
)p(lx
i
, Θ
g
) +λ
_
l
α
l
−1
__
= 0 (4.18)
From above, we have:
N
i=1
1
α
l
η
i
p(lx
i
, Θ
g
) +λ = 0 (4.19)
Summing both the numerator and denominator of the ﬁrst term over l, the left side of the
equation becomes:
M
l=1
N
i=1
η
i
p(lxi, Θ
g
)
M
l=1
α
l
+λ =
N
i=1
M
l=1
η
i
p(lx
i
, Θ
g
)
M
l=1
α
l
+λ (4.20)
=
N
i=1
η
i
+λ (4.21)
Let the above expression equal 0, we can easily get λ = −
N
i=1
η
i
. Substituting λ in the
above equation we ﬁnally get the estimation of α
l
:
α
l
=
N
i=1
η
i
p(lx
i
, Θ
g
)
N
i=1
η
i
(4.22)
Next we derive the update rule for the mean and covariance of the Gaussian components.
This is done by maximizing the second term in the
ˆ
Q function. Given the form of a D
dimensional Gaussian distribution, we know
p
l
(xµ
l
, Σ
l
) =
1
(2π)
d
/2Σ
l

1/2
exp
_
−
1
2
(x −µ
l
)
T
Σ
−1
l
(x −µ
l
)
_
(4.23)
Substituting this distribution form into the second term of 4.17 and taking away constant
terms that will be eliminated after taking derivatives, we get:
M
l=1
N
i=1
η
i
log(p
l
(x
i
Θ
l
))p(lx
i
, Θ
g
)
=
M
l=1
N
i=1
η
i
_
−
1
2
log(Σ
l
) −
1
2
(x
i
−µ
l
)
T
Σ
−1
l
(x
i
−µ
l
)
_
p(lx
i
, Θ
g
) (4.24)
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 54
The maximization of the above expression is achieved by taking the derivative with
respect to µ
l
and letting it equal to 0:
N
i=1
η
i
Σ
−1
(x
i
−µ
l
)p(lx
i
, Θ
g
) = 0 (4.25)
Then we can work out the current estimation for µ
l
:
µ
l
=
N
i=1
x
i
p(lx
i
, Θ
g
)
N
i=1
η
i
p(lx
i
, Θ
g
)
(4.26)
Similarly, take the derivative of (4.45) with respect to the covariance Σ
l
and setting it
equal to 0, we get:
N
i=1
η
i
p(lx
i
, Θ
g
)
_
Σ
l
−(x
i
−µ
l
)(x
i
−µ
l
)
T
_
= 0 (4.27)
So the estimation for Σ is:
Σ
l
=
N
i=1
η
i
p(lx
i
, Θ
g
)(x
i
−µ
l
)(x
i
−µ
l
)
T
N
i=1
η
i
p(lx
i
, Θ
g
)
(4.28)
4.5 HMGM Parameter Reestimation with the Weighted EM
Algorithm
After showing the derivation of the exponentially weighted EM algorithm for the Gaussian
mixture case, we now return to our HMGM model. As discussed in chapter 2 and 3, we
have two intermediate variables describing the whole process of generating the observed
data: γ
i
(t) represents the probability of being in state i at time t and w
im
is the probability
of picking the m
t
h mixture component at state i. These intermediate variables combine
together to generate the probability of the observed data. We can deﬁne the probability
δ
im
(t) that denotes the m
th
mixture density generating observation o
t
as:
δ
im
(t) = γ
i
(t)
w
im
b
im
(o
t
)
b
i
(o
t
)
(4.29)
We ﬁnd that the δ
im
(t) enjoys the same status as the prior probability p(lx
i
, Θ
g
) in previous
discussions. The calculation of the transition probability a
ij
also involves a time variable
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 55
thus can also be weighted. So the exponential weighted update rules for the HMGM are
derived in a similar form, taking the timeintensifying weight into consideration:
µ
im
=
T
t=1
η
t
δ
im
(t)o
t
T
t=1
η
t
δ
im
(t)
(4.30)
σ
im
=
T
t=1
η
t
δ
im
(t)(o
t
−µ
im
)(o
t
−µ
im
)
T
t=1
η
t
δ
im
(t)
(4.31)
w
im
=
T
t=1
η
t
δ
im
(t)
T
t=1
η
t
δ
i
(t)
(4.32)
a
ij
=
T
t=1
η
t
α
i
(t)a
ij
b
j
(o
t+1
)β
j
(t + 1)
T
t=1
η
t
α
i
(t)β
i
(t)
(4.33)
4.6 Calculation and Properties of the Exponential Moving
Average
Before we go on to show how we can rewrite the reestimation formulas with EMAs of the
intermediate variables, let’s take a closer look at the calculation and properties of the EMA.
The EMA is invented originally as a technical analysis indicator that keeps track of
the trend changes in ﬁnancial time series. Shortterm EMAs response faster than long
term EMAs to changes in the trend. When the curves of two EMA with diﬀerent length
cross each other, this usually means a trend change. Technical analysis people developed
other indicators based on the EMA, such as the Moving Average Convergence Divergence
(MACD). Figure 4.1 shows the signals indicated by comparing the 30day EMA and the
100day EMA for an individual stock Intel :
Take the variable p
t
(ij) for illustration. We denote p
t
(ij) as the exponential moving
average of the original variable p
t
(ij) at time t. As by its name implies, the EMA changes
as time moves on. The calculation of a ﬁve day exponentially moving average for this new
intermediate variable consists of two steps:
First, calculate the arithmetic average of the ﬁrst ﬁve days. Denote this average as
p
5
(ij). Second, starting from the sixth day, the EMA of each day is:
p
t
(ij) = ρ ×p
t
(ij) + (1 −ρ) ×p
t−1
(ij) (4.34)
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 56
Figure 4.1: When the 30day moving average moves above the 100day moving average, the trend
is considered bullish. When the 30day moving average declines below the 100day moving average,
the trend is considered bearish. (picture from stockcharts.com)
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 57
Note that in the above equation, p
t
(ij) is the actual intermediate variable value at time
t, while p
t
(ij) is the exponential moving average of that variable at time t. We keep running
the second step until the last day of the training pool.
The EMA of the second intermediate variable δ
i
(t) that stores the probability of being
at state i at time t, can be calculated in a similar way:
δ
i
(t) = ρ ×δ
i
(t) + (1 −ρ) ×δ
i
(t −1) (4.35)
In equations (4.1) and (4.2), ρ is the weight that we have been talking about all the
time. The value of ρ is given by the formula: ρ =
2
1+k
, k = the number of days in the
moving window
For example, if the size the moving window is 5, which means we are calculating a ﬁve
day EMA, then the value of ρ would be:
ρ =
2
1 + 5
= 0.3333... (4.36)
So the last day accounts for one third of the ﬁnal value.
The 5day EMA at the 20th day of the training pool is calculated by running equation
(4.1) until the 20th day:
p
6
(ij) = p
6
(ij) ×ρ +p
5
(ij) ×(1 −ρ)
p
7
(ij) = p
7
(ij) ×ρ +p
6
(ij) ×(1 −ρ)
......
p
20
(ij) = p
20
(ij) ×ρ +p
19
(ij) ×(1 −ρ)
As we can see from equation (4.2), the last day in the training window takes up a
quite large portion and should notably aﬀect the training procedure. As time goes on, the
importance of the early data drops exponentially, as shown in Fig 4.3. This is easily shown
when we go into the details of the calculation of equation (4.1):
p
t
(ij) = ρ ×p
t
(ij) + (1 −ρ) ×p
t−1
(ij)
= ρ ×p
t
(ij) + (1 −ρ) ×[ρ ×p
t−2
(ij) + (1 −ρ) ×p
t−2
(ij)]
= ρ ×p
t
(ij) + (1 −ρ) ×[ρ ×p
t−2
(ij) + (1 −ρ)
2
×p
t−2
(ij)] (4.37)
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 58
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 2 4 6 8 10 12 14 16 18 20
E
M
A
p
e
r
c
e
n
t
a
g
e
trading day
EMA
Figure 4.2: The portion of each of the ﬁrst 19 days’ EMA in calculating the EMA for the 20th day.
As is shown in the ﬁgure, the EMA at the 19th day accounts for 2/3 of the EMA at day 20.
Now we see from (4.5), in calculating the EMA at day t, the weight of the EMA at day
t −1 takes up a portion of (1 −ρ) while the weight of the EMA at day t −2 is only (1 −ρ)
2
.
We can also know by inference that day t −3 will have a portion of (1 −ρ)
3
in EMA at day
t. Fig 4.4 gives a graphical view of the portion each day takes in calculating of the EMA.
4.7 HMGM Parameters Reestimation in a Form of EMA
In previous sections we derived the exponentially weighted estimation rules for the parameters
of the HMGM. We show the formulas with the exponential discount weight η. However,
there is still one problem regarding η: how is the timedependent variable η deﬁned? In fact
η can be taken as a density function whose distribution is prior knowledge. Our task now
is to ﬁnd an expression of time for η.
A nice property about the exponentially weighted intermediate parameters is that, when
calculating this weighted average, while the large part of weight is given to the recent data,
the information from the early data are not lost. The old data are also included in the ﬁnal
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 59
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 2 4 6 8 10 12 14 16 18 20
d
a
i
l
y
d
a
t
a
p
e
r
c
e
n
t
a
g
e
trading day
daily data
Figure 4.3: The portion of each of the 20 days’ actual data in calculating the EMA for the 20th
day. The last day account for 1/3 of the EMA at day 20. The data earlier than day 10 take up a
very small portion but they are still included in the overall calculation
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 60
average, by a discounted weight. The older the data, the less important they become, and
thus take up less portion in the average. But their impact never disappear.
Giving more recent data more weight makes the HMM getting sensitive to trend changes
in time series. The HMM will ride on the new trend quickly, without having to make too
many mistakes at the beginning of a new trend before it gets ﬁt for the new trend. That is
one important motivation of replacing the arithmetic average with EMA in the EM update
rule.
Originally, the EMA was invented as a technical analysis tool that deals with ﬁnancial
time series data, such as prices, volume, etc directly. In the EM algorithm for HMM
parameters reestimation, there are a set of middle variables: α
i
(t), β
i
(t), δ
im
(t), δ
i
(t). These
middle variables all have their actual meanings: they indicate the probabilities of some
events at a certain point of time. So these model variables are also related to the time series
data. This is another motivation we rewrite the reestimation formulas in the form of EMA
of the model variables. We take the denominator of the µ reestimation for example. From
T
t=1
η
t
δ
im
(t) we can see each δ
im
(t) at a timing point t is associated with one value of η
t
.
The sum can be extended as:
T
t=1
η
t
δ
im
(t)
= η
1
δ
im
(1) +η
2
δ
im
(2) +· · · +η
T−1
δ
im
(T −1) +η
T
δ
im
(T) (4.38)
Let ρ =
2
1+k
where k is for the multiplier in a kEMA. Then set η
t
by the following rules:
1. If 1 ≤ t ≤ k, η
t
= (1 −ρ)
(T−k)
·
1
k
;
2. If k < t < T, η
t
= ρ · (1 −ρ)
(T−t)
;
3. If t = T, η
t
= ρ.
Let δ
im
(t) represent the kEMA of δ
im
at time t. Based on these rules, the expression
(4.38) is turned into a form of the EMA of the intermediate variable δ
im
(t):
η
1
δ
im
(1) +η
2
δ
im
(2) +· · · +η
T−1
δ
im
(T −1) +η
T
δ
im
(T)
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 61
= (1 −ρ)
(T−k)
·
1
k
· δ
im
(1) +· · · + (1 −ρ)
(T−k)
·
1
k
· δ
im
(k)
+ρ · (1 −ρ) · δ
im
(T −1) +ρ · δ
im
(T)
= (1 −ρ)
(T−k)
· δ
im
(k) +ρ · (1 −ρ)
(n−(k+1))
· δ
im
(k + 1)
+ρ · (1 −ρ)
(n−(k+2))
· δ
im
(k + 2) +· · ·
+ρ · (1 −ρ) · δ
im
(T −1) +ρ · δ
im
(T) (4.39)
The ﬁrst k terms work together to generate the EMA for the ﬁrst k periods, then from
k + 1 to T, apply Formula (4.35) to δ
im
and induction on Equation (4.39), we can get the
following result:
(1 −ρ) ·
_
ρ · δ
im
(T −1) + (1 −ρ) · δ
im
(T −2)
_
= (1 −ρ) · δ
im
(T −1) +ρ · δ
im
(T)
= δ
im
(T) (4.40)
Therefore, the reestimation formulas ﬁnally came into a form of EMA of some intermediate
variables.
µ
im
=
T
t=1
η
t
δ
im
(t)o
t
δ
im
(T)
(4.41)
σ
im
=
T
t=1
η
t
δ
im
(t)(o
t
−µ
im
)(o
t
−µ
im
)
δ
im
(T)
(4.42)
w
im
=
δ
im
(T)
δ
i
(T)
(4.43)
a
ij
=
T
t=1
η
t
α
i
(t)a
ij
b
j
(o
t+1
)β
j
(t + 1)
α
i
(T)β
i
(T)
(4.44)
4.8 The EM Theorem and Convergence Property
This section shows that our exponentially weighted EM algorithm is guaranteed to converge
to a local maximum by a revised version of the EM Theorem. For review of the EM Theorem
please refer to [Jel97].
The EM Theorem for exponentially weighted EM algorithm: Let X and Y be
random variables with distributions
ˆ
P(X) and
ˆ
P(Y ) under some model
ˆ
Θ. And let
ˆ
P
(X)
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 62
and
ˆ
P
(Y ) be the distribution under a model Θ
after running the exponentially weighted
EM algorithm. Then,
_
Y
ˆ
P(Y X, η) log
ˆ
P
(Y, Xη)
ˆ
P(Y, Xη)
≥ 0
_
⇒
ˆ
P
(Xη) ≥
ˆ
P(Xη) (4.45)
This Theorem implies that if for a model
ˆ
Θ and an updated model
ˆ
Θ
, if the left hand
inequality holds, then the observed data Y will be more probable under
ˆ
Θ
than
ˆ
Θ.
Proof: From the derivation of the exponentially weighted update rule we know
0 ≤
Y
ˆ
P(Y X, η) ≤ 1
And the logarithm function is a monotonous one. So,
log
ˆ
P
(Xη) ≥ log
ˆ
P(Xη)
⇔
Y
ˆ
P(Y X, η) log
ˆ
P
(Xη) ≥
Y
ˆ
P(Y X, η) log
ˆ
P(Xη)
⇔
Y
ˆ
P(Y X, η) log
ˆ
P
(Xη) −
Y
ˆ
P(Y X, η) log
ˆ
P(Xη) ≥ 0 (4.46)
Then we can perform a series of logarithm transformation operations on expression (4.86):
Y
ˆ
P(Y X, η) log
ˆ
P
(Xη) −
Y
ˆ
P(Y X, η) log
ˆ
P(Xη)
=
Y
ˆ
P(Y X, η) log
ˆ
P
(Y, Xη)
ˆ
P
(Y X, η)
−
Y
ˆ
P(Y X, η) log
ˆ
P(Y, Xη)
ˆ
P(Y X, η)
=
Y
ˆ
P(Y X, η) log
ˆ
P
(Y, Xη)
ˆ
P
(Y X, η)
+
Y
ˆ
P(Y X, η) log
ˆ
P(Y X, η)
ˆ
P(Y, Xη)
=
Y
ˆ
P(Y X, η) log
ˆ
P
(Y, Xη)
ˆ
P(Y, Xη)
+
Y
ˆ
P(Y X, η) log
ˆ
P(Y X, η)
ˆ
P
(Y X, η)
≥
Y
P(Y X, η) log
ˆ
P
(Y, X)
ˆ
P(Y, X)
≥ 0 (4.47)
Q. E. D.
The last step follows from the Jensen’s inequality. The Jensen’s inequality basically states
that misestimation of the distribution governing a random variable will eventually result in
increased uncertainty/information. For a thorough review of the Jensen’s inequality please
refer to [?] [HLP88] and [GI00].
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 63
Market
Prediction
Figure 4.4: The prediction falls behind the market during highly volatile periods of time. This is
because the system is too sensitive to recent changes.
4.9 The Double Weighted EM Algorithm
There is still room for improvement in the previous exponentially weighted EM rule. However,
just as any other improvements, we have to consider the cost of the improvement of
sensitivity and ﬁnd out a way to overcome any possible problem.
4.9.1 Problem with the Exponentially Weighted Algorithm
The exponentially weighted EM algorithm quickly picks the trend at times when there is
a turning of the trend. However, sometimes, especially during the highly volatile periods,
the direction of the ﬁnancial time series reverse at a very high frequency in a very short
period of time. In these cases, there is danger of falling right behind the market movement,
as shown in Fig 4.5.
Fig 4.5 shows a problem caused by high sensitivity. Not only the turning points of the
market are all missed, the prediction of the turning points are not the actual market turning
points. This is because the system is too sensitive to recent data that they fall into the pitfall
of high volatility. Actually if one keeps making prediction of one same direction, or if he
refers to the long term trend during that period of time, he may have a better opportunity
of getting a higher gain.
4.9.2 Double Exponentially Weighted Algorithm
To solve this problem, we arm the HMM with a selfconﬁdence adjustment capability.
Multiply the weight for the most recent data η with a second weight ζ. ζ is the conﬁdence
rate for the weight η. It is calculated by comparing the trend signal, i.e., positive and
CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 64
negative symbols of the S&P 500 Index with the trend signal of the Dow Jones Index. The
Dow Jones is a relatively mild index, comparing to the S&P 500 and NASDAQ. We can of
course use other indicators that reserves long term trend well, but for the time being, we
will use the Dow for our experiment.
Comparing the S&P 500 with the Dow for the past ﬁve days. If they have the same
directions for everyday of the past 5 days, the HMM gets full conﬁdence 1. Otherwise, the
conﬁdence weight ζ is calculated as the total number of days that Dow coincides with the
S&P divided by 5. Then time the two weights together, this is the weight that we will use
for further calculation.
The adjustable weighted exponential average for the two intermediate variables are:
p
t
(ij) = η ×ζ ×p
t
(ij) + (1 −η ×ζ) ×p
t−1
(ij)δ
i
(t)
= η ×ζ ×δ
i
(t) + (1 −η ×ζ) ×δ
i
(t −1) (4.48)
Then apply the formulas (4.134.16) to get the updated parameters.
4.10 Chapter Summary
In this chapter we derived a novel exponentially weighted EM algorithm for training HMGMs.
The EWEM algorithm takes the time factor into consideration by discounting the likelihood
function in traditional EM by an exponential attenuation weight which we call the conﬁdence
weight. This is very important in ﬁnancial time series analysis and prediction because
ﬁnancial time series are known to be nonstationary and diﬀerent periods should be of
diﬀerent interest in generating a prediction for the future. We show that the reestimation
of the parameters come in a form of exponential moving average of the intermediate variables
used for calculation of the ﬁnal parameters. We prove the convergence to a local maximum
is guaranteed by an exponential version of the EM Theorem. To ﬁnd better balance between
longterm and shortterm trends, we develop a doubleweighted EM algorithm.
Chapter 5
Experimental Results
This section has all the experimental results inside. The ﬁrst section shows the prediction
results on a synthetic data set. The second section includes the predictions for the years
1998 to 2000 of the NASDAQ Composite Index with a recurrent neural network. The
third section includes prediction results of the HMM with Gaussian mixtures as emission
density functions and trained with the simple weighted EM algorithm. We test a set of data
with diﬀerent training windows. The last section shows the results from the weighted EM
algorithm.
5.1 Experiments with Synthetic Data Set
We ﬁrst test diﬀerent models on an artiﬁcially generated data set. The data is generated
by a function composed of three parts:
f(t) = 0.2 ∗ (τ(t)/100 −0.005) + 0.7 ∗ f(t −1) + 0.1 ∗ 0.00001 (5.1)
τ(t) is a random number that takes 20% of the overall output. 70% of the data at step
t is from that of step t −1. This is to try to hold the impact of the recent past. And lastly
there is a 10% of upgrowing trend at the growth rate of 0.00001. 20% of the data are
generated randomly. This may not exactly be the case in real life ﬁnancial time series, but
our model should be able to at least handle it before applied to real data.
We test the recurrent neural network and the HMGM on this artiﬁcial data set for
400 days. Both models predict the signal 5 days ahead. Results show that over the test
period of 395 days (prediction starts on the 6th day), the HMGM performs better than
65
CHAPTER 5. EXPERIMENTAL RESULTS 66
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
follow
HMM
recurrent network
index
Figure 5.1: Experimental results on computer generated data.
the recurrent neural network by 10.55%, and 22.2% above the cumulative return based on
the artiﬁcial generated data. The HMGM underperforms a followthetrend strategy which
takes yesterday’s signal as today’s prediction. This is because the data are composed in
a way that a large portion of the data is taken directly from the recent past. However, it
is clearly shown that the HMGM is a better model of the two statistical learning models
and it works better than the buyandhold strategy (which is the index). In the following
sections we will test the HMGM on reallife data, the S&P 500 Index.
The following table shows the comparisons of accuracy between the recurrent network
and the exponentially weighted HMGM. The accuracy rates are based on statistics from
ten runs of both models on ten data sets generated by the equation 5.1. As we can see the
exponentially weighted HMGM(EWHMGM) outperforms the recurrent network for positive
signals and slightly behind on negative ones, but has a better overall performance. And the
EWHMGM has less volatility in prediction accuracy at diﬀerent runs of experiment.
CHAPTER 5. EXPERIMENTAL RESULTS 67
Positive Negative Overall
Min/Max/Avg Min/Max/Avg Min/Max/Avg
Recurrent Net 0.44/0.72/0.55 0.52/0.79/0.62 0.53/0.65/0.58
EWHMGM 0.54/0.80/0.64 0.54/0.68/0.57 0.54/0.64/0.61
Table 5.1: Comparison of the recurrent network and exponentially weighted HMGM on the synthetic
data testing. The EWHMGM outperforms the recurrent network in overall result, and has less
volatility.
5.2 Recurrent Neural Network Results
The NASDAQ Index data are separated into three years. As can be seen from the data,
the recurrent neural network has a positve bias toward steady trends. By that we mean the
recurrent network performs better during long term bull market (as in late 1998 and early
1999) or long term bear market (as in the year 2000).
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
0 50 100 150 200 250 300
r
e
t
u
r
n
trading day
prediction
market
Figure 5.2: Prediction of the NASDAQ Index in 1998 with a recurrent neural network.
As we can see from the above ﬁgures, recurrent neural networks work better during low
volatile, steady trends, no matter it is a bull market or bear market. This is especially
CHAPTER 5. EXPERIMENTAL RESULTS 68
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
0 50 100 150 200 250 300
r
e
t
u
r
n
trading day
prediction
market
Figure 5.3: Prediction of the NASDAQ Index in 1999 with a recurrent neural network.
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
0 50 100 150 200 250 300
r
e
t
u
r
n
trading day
prediction
market
Figure 5.4: Prediction of the NASDAQ Index in 2000 with a recurrent neural network.
CHAPTER 5. EXPERIMENTAL RESULTS 69
apparent in the prediction for the year 2000. Part of the reason is due to the memory
brought by the recurrent topology. The past data has too much impact on the current
training and prediction.
5.3 HMGM Experimental Results
We launch a series of experiments on the Hidden Markov Gaussian Mixtures (HMGM). The
training window introduced in chapter 3 is adjustable from 10day to 100day. The EM
algorithm is performed on the training data with diﬀerent training windows. Results show
that the prediction performance changes as the training window size is adjusted. The best
result is achieved by the 40day training day window. And prediction becomes worse as the
training window is changed in either direction.
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
1.5
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
prediction
market
Figure 5.5: Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. Training
window size = 10
We test all the above mentioned approaches in chapter 4 with 400 days of the S&P 500
Index starting from Jan 2, 1998.
First take a look at the HMGM with Rabiner’s arithmetic average update rule. From
t213 to t228, the actual market had an apparent reverse in direction, from a decline to an
CHAPTER 5. EXPERIMENTAL RESULTS 70
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
0 50 100 150 200 250 300 350 400
t213 t228
r
e
t
u
r
n
trading day
prediction
market
Figure 5.6: Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. Training
window size = 20
increase. The simple HMGM was not able to recognize this in a short time, which costs a
loss in proﬁt. But the overall cumulative proﬁt at the end of the testing period is still 20%
above the market return.
The multivariate HMGM takes a vector of {S&P daily return, Dow Jones daily return}
as input. We do not know the exact relationship between those two ﬁnancial time series, but
the multivariate HMM uses the EM algorithm to ﬁnd the best parameters that maximizes
the probability of seeing both series data. The multivariate HMGM model generates similar
cumulative return as the single series HMGM but has less volatility at the later part of the
testing data.
The exponentially weighted EM algorithm performs signiﬁcantly better than the ﬁrst
two models, mainly because it picks up a reverse trend much faster. During the periods of
t164 to t178 and t213 to t228, the trend reversals were recognized and we gain points here,
so that the later cumulative gains can be based on a higher early gain. However, it also
loses points during periods where the volatility is higher, such as the period after t228. At
the end of the whole period, a bull market started and the model responded quickly to this
CHAPTER 5. EXPERIMENTAL RESULTS 71
0.8
1
1.2
1.4
1.6
1.8
2
2.2
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
prediction
market
Figure 5.7: Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. Training
window size = 40
signal and made no mistake during the long bull. We made nearly twice as much at the end
of the 400 days.
Then we tested the double weighted EM algorithm. There isn’t quite much diﬀerence
from the single exponentially weighted EM algorithm. This is partly because the S&P 500
and the Dow give identical direction signals most of the time during the testing period. But
a small diﬀerence at the end of the 400 days gives the double weight algorithm a 2% better
performance.
Finally we produce an allinone picture with all the prediction results from diﬀerent
models for comparison. As seen from the picture, the two weighted HMGM performs
convincingly better than the traditional EM algorithm.
5.4 Test on Data Sets from Diﬀerent Time Periods
To show the results generated by the HMGM are not random and can repeatedly beat the
buyandhold strategy, we test the model on data sets from diﬀerent time periods. The
data tested are four segments of the S&P 500 Index: November21994 to June31996,
CHAPTER 5. EXPERIMENTAL RESULTS 72
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
0 50 100 150 200 250 300 350
r
e
t
u
r
n
trading day
prediction
market
Figure 5.8: Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. Training
window size = 100
June41996 to December311998, August51999 to March72001 and March82001 to
October112002 (the period January21998 to August41999 has been tested before), The
ﬁrst two segments are the index in the bull market cycle while the latter two segments are
the periods of a bear market. Experiments show that HMGM performs equally well under
both situations.
5.5 Comparing the Sharpe Ratio with the Top Five S&P 500
Mutual Funds
The Sharpe Ratio, also called the rewardtovariability ratio, is a measure of the riskadjusted
return of an investment. It was derived by W. Sharpe [Sha66]. Sharpe argues that the mean
and variance by themselves are not suﬃcient to measure the performance of a mutual fund.
Therefore he combined the two variable into one and developed the Sharpe Ratio.
The deﬁnition of the Sharpe Ratio is quite straightforward: Let R
Ft
be the return on
the fund in period t, R
Bt
the return on the benchmark portfolio or security in period t, and
CHAPTER 5. EXPERIMENTAL RESULTS 73
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
0 50 100 150 200 250 300 350 400
t213 t228
r
e
t
u
r
n
trading day
prediction
market
Figure 5.9: Prediction with HMGM for 400 days of the S&P 500 Index. Note from t213 to t228,
the HMM was not able to catch up with the change in trend swiftly.
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
prediction
market
Figure 5.10: Prediction with multivariate HMGM for 400 days of the S&P 500 Index.
CHAPTER 5. EXPERIMENTAL RESULTS 74
0.8
1
1.2
1.4
1.6
1.8
2
0 50 100 150 200 250 300 350 400
t164 t178 t213 t228
r
e
t
u
r
n
trading day
prediction
market
Figure 5.11: Prediction with single exponentially weighted HMGM
0.8
1
1.2
1.4
1.6
1.8
2
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
prediction
market
Figure 5.12: Prediction with double weighted HMGM that has a conﬁdence adjustment variable.
CHAPTER 5. EXPERIMENTAL RESULTS 75
0.8
1
1.2
1.4
1.6
1.8
2
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
double weight
single weight
HMGM
multivariate Gaussian
market
Figure 5.13: Results from all models for comparison.
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
prediction
market
Figure 5.14: Results on the period from November 2, 1994 to June 3, 1996
CHAPTER 5. EXPERIMENTAL RESULTS 76
0.8
1
1.2
1.4
1.6
1.8
2
2.2
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
prediction
market
Figure 5.15: Results on the period from June 4, 1996 to December 31, 1998
0.8
1
1.2
1.4
1.6
1.8
2
2.2
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
prediction
market
Figure 5.16: Results on the period from August 5, 1999 to March 7, 2001
CHAPTER 5. EXPERIMENTAL RESULTS 77
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 50 100 150 200 250 300 350 400
r
e
t
u
r
n
trading day
prediction
market
Figure 5.17: Results on the period from March 8, 2001 to October 11, 2002
D
t
the diﬀerential return in period t:
D
t
= R
Ft
−R
Bt
(5.2)
Let
¯
D be the average value of Dt over the historic period from t = 1 through T:
¯
D =
1
T
T
t=1
D
t
(5.3)
and σ
D
be the standard deviation over the period:
σ
D
=
¸
T
t=1
(D
t
−
¯
D)
2
T
(5.4)
The Sharpe Ratio (S
h
) is:
S
h
=
¯
D
σ
D
(5.5)
As we can see, the ratio indicates the average diﬀerential return per unit of historic
variability of the diﬀerential return. Using the Sharpe Ratio, we can compare the return of
CHAPTER 5. EXPERIMENTAL RESULTS 78
Symbol Explanation
BTIIX 0.02948
FSMKX 0.011475
SPFIX 0.000502
SWPPX 0.024323
VINIX 0.00142
HMGM 0.084145
Table 5.2: Comparison of the Sharpe Ratio between the HMM and the top 5 S&P 500 funds during
the period of August 5, 1999 to March 7, 2001. Four of the ﬁve funds have negative Sharpe Ratios.
The absolute value of the positive one is much smaller than the HMGM prediction.
Symbol Explanation
BTIIX 0.05465
FSMKX 0.003319
SPFIX 0.01856
SWPPX 0.01726
VINIX 0.001446
HMGM 0.098967
Table 5.3: Comparison of the Sharpe Ratios during the period of March 8, 2001 to October 11,
2002. The Double weighted HMGM consistently outperforms the top ﬁve S&P 500 mutual funds.
diﬀerent strategies at the same risk. This is the meaning of beat the market mentioned in
chapter 1.
The Sharpe Ratio can be used for signiﬁcance testing for ﬁnancial time series analysis
and prediction. This is because the tstatistic equals the Sharpe Ratio times the square
root of the total number of testing units. The Sharpe Ratio is thus proportional to the
t −statistic of the means.
We now compare the top 5 S&P 500 mutual funds that performed best over the past
three years with the HMGM model we proposed. The HMGM outperforms all of them. 4
of the 5 funds have negative Sharpe Ratios while the positive one is much smaller than the
HMGM.
CHAPTER 5. EXPERIMENTAL RESULTS 79
5.6 Comparisons with Previous Work
Hidden Markov Model, as a sequence learning model, has been applied to many ﬁelds
including natural language processing, bioinformatics, audio and video signals processing
(see 2.4). But there have been few applications on the application of HMMs to ﬁnancial time
series analysis and prediction. One reason is it is usually hard to ﬁnd functions associated to
the states that characterizes the behavior of the time series. This is an empirical research.
For example, in [MCG04], they use a binomial density function at each state to represent
the risk of default.
Second reason lies in the EM algorithm. Financial time series diﬀers from other sequence
data in that they are time dependent. In video/audio/language/bio info signals processing,
this dependence is not important, or the data must be treated of equal importance. Patterns
in ﬁnancial time series tend to change over time, as we have shown in the super ﬁve day
pattern and the January Eﬀect cases.
Bengio [BF96] introduced an InputOutput Hidden Markov Model (IOHMM) and applies
it to the Electroencephalographic signal rhythms [CB04]. Their model enables the HMM
to generate the output probability based on the current state as well as the current input
data sample from the input sequence. But they still don’t realize the importance of time.
The input sequence is just another sequence data.
In the two weighted EM algorithms we propose in the thesis, the input sequence is a
time dependent value that changes from time to time. The state parameters in the re
estimation formulas all have actual meanings. They represent the probabilities of a variety
of situations happening in the HMM. Adding the timedependent weights incorporates
temporal information into the training process of the HMM. Further, the double weighted
EM algorithm enables selfadjustment of the weights. These techniques enrich the representation
power of the HMM in dealing with ﬁnancial time series data.
We now compare our experimental results with Shi and Weigend’s work [SW97]. As
described before, they used three neural networks (called experts) as the emission functions
at each state. So there are three states in all. Each expert has its own expertise: the ﬁrst
one is responsible for low volatility regions, the second one is good at high volatility regions,
and the third expert acts as a collector for the outliers.
They test on 8 years of daily S&P 500 data from 1987 to 1994. Their return over the 8
year period is 120%. In our experiments, each of the ﬁve 400day testing periods generates
CHAPTER 5. EXPERIMENTAL RESULTS 80
a return close to 100%.
However the main problem of their prediction is not the rate of return. As we can see
from the proﬁt curve (Figure 5.6), the proﬁt curve has a similar shape as the index curve.
This means that there are too many positive signals generated by the prediction system. In
other words, many times they are taking a long position. The proﬁt of the prediction thus
follows the index behavior to a large extent.
Figure 5.18: Too many positive signals result in a curve similar to the index.
In all our ﬁve 400day periods, the proﬁt curves based on the prediction have a diﬀerent
shape than the index curve. This reﬂects a more sophisticated strategy that generates
dynamically changing signals.
Chapter 6
Conclusion and Future Work
This chapter concludes the thesis. We show our ﬁndings through the experiments and
indicate possible improvements to the tested model.
6.1 Conclusion
Our experiments can be sorted into three categories: testing of HMGM with diﬀerent
training window sizes; testing of the weighted EM algorithm in comparison with the classic
EM algorithm; the use of supporting time series to help predict the targeting series.
Comparing with neural network experts, the parameters of the HMGM as emission
density functions can be clearly derived from the EM algorithm. By using a mixture of
Gaussian density functions at each state, the emission functions provide a broad range of
predictions.
Results show that there is no incorrect bias as shown in the neural network experiments.
The HMGM performs equally good in both bull market and bear market. Although at
times where the ﬁnancial time series is volatile, it will make more mistakes, HMGM can
eventually catch up with the trend. In a 20day experiment the cumulative rate of return
is 20% above the actual market return over the 400 days of training and testing period.
We introduce an important concept in the model design, the training window. In fact,
the training window is used in both training and testing as these two processes are combined
together throughout the experiment. It is an empirical study what the size of the training
window should be. Experimental results show that a 40day training window generates the
best prediction. The accuracy of prediction decreases as the size of the training window
81
CHAPTER 6. CONCLUSION AND FUTURE WORK 82
changes in either direction. This means to get the best prediction, we have to balance
between shortterm sensitivity and longterm memory.
Noticing that recent data should be given more emphasis than older data from the
past, we revise the general EM algorithm to a exponentially weighted EM algorithm. This
algorithm is motivated by the exponential moving average technique used widely in ﬁnancial
engineering. The weighted EM algorithm shows an increased sensitivity to recent data.
However during highly sensitive periods, the model might fall in the pitfall of frequently
changing whipsaws, i.e., buy and short signal reverses in very a very short time. To
overcome this we add a second weight to the exponential weight. The second weight which
is calculated by the divergence of the S&P 500 and the Dow Jones Index, represents the
market volatility. In a 20day training window testing within the same training data, the
revised model generates more stable predictions and a higher rate of return. The end of the
400day training and testing period generates a return of 195%.
6.2 Future Work
Rather than the predeﬁned 20day training window, the size of the training pool and
training window could be decided by a hill climbing or dynamic programming approach.
Identifying useful side information for prediction, such as developing of new indicators will
help making better predictions. Then we can extend the prediction to a form of density
function rather than simply values. This way can we output richer information about the
market movements.
Functions and how functions are mixed at each state are not studied in this thesis. This
can be a valuable work to do. Gaussian distribution is the most popular distribution when
modeling random or unknown processes. But it has been observed that many time series
may not follow the Gaussian distribution.
In this thesis, we discussed the application of the Hidden Markov Model to one single
ﬁnancial time series, more speciﬁcally, we studied the S&P 500 Composite Index. We didn’t
take the portfolio selection problem into consideration. Since diﬀerent ﬁnancial time series
have diﬀerent characteristics, a portfolio composed of diﬀerent stocks will have diﬀerent
rate of returns and risk levels. In other words, they will have diﬀerent patterns. It will be
interesting to apply the HMM technique to a portfolio of ﬁnancial assets and study the time
series generated by these groups of assets. The Viterbi algorithm could play an important
CHAPTER 6. CONCLUSION AND FUTURE WORK 83
role in this application. If we put functions that are estimated to best characterize diﬀerent
assets at diﬀerent states, the path of the Viterbi algorithm will be able to tell what assets
should be included in the portfolio, observing the price movements. [Cov91] oﬀers a nice
introduction to universal portfolio selection problem.
The double weights in the HMGM model can also be trained. These weights decide the
sensitivity of the model, so by setting up a function that studies the relationship between
sensitivity and accuracy, we can expect to ﬁnd better weights.
The outputs of the HMGM models we present in the thesis are positive and negative
symbols. A more delicate model will be able to output densities rather than symbols, i.e.,
by how much probability the market will go up or down. Furthermore, the density can be
designed to describe the probability of falling into a range of values.
Appendix A
Terminology List
outlier Outliers are a few observations that are not well ﬁtted by the best available model.
Usually if there is no doubt about the accuracy or veracity of the observation, then it
should be removed, and the model should be reﬁtted.
whipsaw A whipsaw occurs when a buy or sell signal is reversed in a short time. Volatile
markets and sensitive indicators can cause whipsaws.
trend Refers to the direction of prices. Rising peaks and troughs constitute an uptrend;
falling peaks and troughs constitute a downtrend. A trading range is characterized by
horizontal peaks and troughs.
volatility Volatility is a measurement of change in price over a given period. It is usually
expressed as a percentage and computed as the annualized standard deviation of the
percentage change in daily price.
risk Same meaning as volatility. The more volatile a stock or market, the more risky the
stock is, and the more money an investor can gain (or lose) in a short time.
indicator An indicator is a value, usually derived from a stock’s price or volume, that an
investor can use to try to anticipate future price movements.
Sharpe Ratio Also called the rewardtorisk ratio, equals the potential reward divided by
the potential risk of a position.
Exponential Moving Average A moving average (MA) is an average of data for a certain
number of time periods. It moves because for each calculation, we use the latest x
84
APPENDIX A. TERMINOLOGY LIST 85
number of time periods’ data. By deﬁnition, a moving average lags the market. An
exponentially smoothed moving average (EMA) gives greater weight to the more recent
data, in an attempt to reduce the lag.
proﬁt Proﬁt is the cumulative rate of return of a ﬁnancial time series. To calculate today’s
cumulative return, multiply today’s actual gain with yesterday’s cumulative return.
short To short means to sell something that you don’t have. To do this, you have to borrow
the asset that you want to short. The reason to sell short is that you believe the price
of that asset will go down.
Bibliography
[ADB98] A. S. Weigend A. D. Back. What Drives Stock Returns?An Independent
Component Analysis. In Proceedings of the IEEE/IAFE/INFORMS 1998
Conference on Computational Intelligence for Financial Engineering, pages
141–156. IEEE, New York, 1998.
[APM02] M. Bicego A. Panuccio and V. Murino. A Hidden Markov ModelBased
Approach to Sequential Data Clustering . In Structural, Syntactic, and
Statistical Pattern Recognition, Joint IAPR International Workshops SSPR
2002 and SPR 2002, Windsor, Ontario, Canada, August 69, 2002, Proceedings,
volume 2396 of Lecture Notes in Computer Science, pages 734–742. Springer,
2002.
[BF96] Y. Bengio and P. Frasconi. InputOutput HMM’s for Sequence Processing.
IEEE Transaction on Neural Networks, 7(5):1231–1249, 1996.
[Bil97] J. A. Bilmes. A Gentle Tutorial of the EM Algorithm and its Application to
Parameter Estimation for Gaussian Mixture and Hidden Markov Models. TR
97021. International Computer Science Institute, UC Berkeley, 1997.
[Boy83] R. A. Boyles. On the Convergence Properties of the EM Algorithm. Journal of
the Royal Statistical Society, Ser B 44:47–50, 1983.
[BP66] L. E. Baum and T. Petrie. Statistical Inference for Probabilistic Functions of
Finite State Markov Chains. Annals of Mathematical Statistics, 37:1559–1563,
1966.
[BS73] F. Black and M. Scholes. The Pricing of Options and Corporate Liabilities.
Journal of Political Economy, 81:637, 1973.
[BWT85] De Bondt, F. M. Werner., and R. H. Thaler. Does the market overreact?
Journal of Finance, 40:793–805, 1985.
[CB04] S. Chiappa and S. Bengio. HMM and IOHMM modeling of EEG rhythms
for asynchronous BCI systems. European Symposium on Artiﬁcial Neural
Networks, ESANN, 2004.
86
BIBLIOGRAPHY 87
[Cha93] E. Charniak. Statistical Language Learning. MIT Press, Cambridge,
Massachusetts, 1993.
[Chi95] G. Chierchia. Dynamics of Meaning: Anaphora, Presupposition and the Theory
of Grammar. The University of Chicago Press: Chicago, London, 1995.
[Cov91] T. Cover. Universal Portfolios. Mathematical Finance, 1(129), 1991.
[CWL96] W. Cheng, W. Wagner, and C.H Lin. Forecasting the 30year U.S. treasury
bond with a system of neural networks. NeuroVe$t Journal, 1996.
[Deb94] G. Deboeck. Trading on the edge : neural, genetic, and fuzzy systems for chaotic
ﬁnancial markets. Wiley, New York, 1994.
[DLR77] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximumlikelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Ser B 39:1–38, 1977.
[Dor96] G. Dorﬀner. Neural Networks for Time Series Processing. Neural Network
World, pages 447–468, 1996.
[Fam65] E. F. Fama. The Behaviour of Stock Market Prices. Journal of Business,
38:34–105, 1965.
[Fam70] E.F. Fama. Eﬃcient Capital Markets: A Review of Theory and Empirical
Work. Journal of Finance, 25:383–417, 1970.
[FST98] S. Fine, Y. Singer, and N. Tishby. The hierarchical Hidden Markov Model:
analysis and applications. Machine Learning, 32, 1998.
[Fur01] S. Furui. Digital speech processing, synthesis, and recognition. Marcel Dekker,
New York, 2001.
[GI00] I. S. Gradshteyn and I.M.Ryzhik. Table of Integrals, Series, and Products.
Academic Press, San Diego, CA, 2000.
[GLT01] C. Lee Giles, S. Lawrence, and A.C. Tsoi. Noisy Time Series Prediction using
a Recurrent Neural Network and Grammatical Inference. Machine Learning,
44:161–183, 2001.
[GP86] C. W. J. Granger and P.Newbold. Forecasting Economic Time Series. Academic
Press, San Diego, 1986.
[Gra86] B. Graham. The Intelligent Investor: The Classic Bestseller on Value Investing.
HarperBusiness, 1986.
BIBLIOGRAPHY 88
[GS00] X. Ge and P. Smyth. Deformable Markov model templates for timeseries
pattern matching. In Proceedings of the sixth ACM SIGKDD international
conference on Knowledge discovery and data mining , pages 81–90. ACM Press
New York, NY, USA, 2000.
[HG94] J. P. Hughes and P. Guttorp. Incorporating spatial dependence and atmospheric
data in a model of precipitation. In Journal of Applied Meteorology, volume
33(12), pages 1503–1515. IEEE, 1994.
[HK01] J. Han and M. Kamber. Data mining: concepts and techniques. Morgan
Kaufmann, San Francisco, 2001.
[HLP88] G. H. Hardy, J. E. Littlewood, and G. Plya. Inequalities. Cambridge University
Press, Cambridge, England, 1988.
[Hul00] J. Hull. Futures, Options and Other Derivatives. PrenticeHall, Upper Saddle
River, NJ, 2000.
[Jel97] F. Jelinek. Statistical methods for speech recognition. MIT Press, Cambridge,
Mass.; London, 1997.
[JP98] T. Jebara and A. Pentland. Maximum conditional likelihood via bound
maximization and the cem algorithm. Advances in Neural Information
Processing Systems, 10, 1998.
[MCG04] M. Davis M. Crowder and G. Giampieri. A Hidden Markov Model of Default
Interaction. In Second International Conference on Credit Risk, 2004.
[Mit97] T. Mitchell. Machine Learning. McGraw Hill, New York ; London, 1997.
[MS99] C. D. Manning and H. Sch¨ utze. Chapter 9: Markov Models. In Foundations of
Statistical Natural Language Processing, Papers in Textlinguistics, pages 317–
379. The MIT Press, 1999.
[Nic99] A. Nicholas. Hidden Markov Mixtures of Experts for Prediction of Switching
Dynamics. In E. Wilson Y.H. Hu, J. Larsen and S. Douglas, editors,
Proceedings of Neural Networks for Signal Processing IX: NNSP’99, pages 195–
204. IEEE Publishing, 1999.
[PBM94] T. Hunkapiller P. Baldi, Y. Chauvin and M.A. McClure. Hidden Markov
Models of Biological Primary Sequence Infromation. In Proceedings of National
Academy of Science USA , volume 91(3), pages 1059–1063. IEEE, 1994.
[Rab89] L. R. Rabiner. A tutorial on hidden markov models and selected applications
in speech recognition. In Proceedings of IEEE, volume 77, pages 257–286, 1989.
BIBLIOGRAPHY 89
[RIADC02] B. C. Lovell R. I. A. Davis and T. Caelli. Improved Estimation of Hidden
Markov Model Parameters from Multiple Observation Sequences. In 16th
International Conference on Pattern Recognition (ICPR’02) Quebec City, QC,
Canada , volume 2, page 20168, 2002.
[RJ86] L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models.
IEEE ASSP Mag., June:4–16, 1986.
[Ros03] S. M. Ross. An Introduction to Mathematical Finance. Cambridge University
Press, Cambridge, New York, 2003.
[Row00] S. Roweis. Constrained Hidden Markov Models. Advances in Neural
Information Processing Systems, 12, 2000.
[RW84] R. Redner and H. Walker. Mixture densities, maximum likelihood and the EM
algorithm. SIAM Review, 26(2), 1984.
[RY98] E. Ristad and P. Yianilos. Towards EMstyle Algorithms for a posteriori
Optimization of Normal Mixtures. IEEE International Symposium on
Information Theory, 1998.
[Sha49] C. E. Shannon. Communication Theory of Secrecy Systems. Bell System
Technical Journal, 28(4):656–715, 1949.
[Sha66] W. F. Sharpe. Mutual Fund Performance. Journal of Business, 39:119–138,
1966.
[STW99] R. Sullivan, A. Timmermann, and H. White. Datasnooping, technical trading
rule performance and the bootstrap. Journal of Finance, 54:1647–1691, 1999.
[SW97] S. Shi and A. S. Weigend. Taking Time Seriously: Hidden Markov Experts
Applied to Financial Engineering. In Proceedings of the IEEE/IAFE 1997
Conference on Computational Intelligence for Financial Engineering , pages
244–252. IEEE, 1997.
[Tha87] R. Thaler. Anomalier: The January Eﬀect. Journal of Economics Perspectives,
1:197–201, 1987.
[Wac42] S. Wachtel. Certain Observations on Seasonal Movements in Stock Prices.
Journal of Business, 15:184–193, 1942.
[WG93] A. S. Weigend and N.A. Gershenfeld. Time Series Prediction: Forecasting the
Future and Understanding the past. AddisonWesley, 1993.
[WMS95] A. S. Weigend, M. Mangeas, and A. N. Srivastava. Nonlinear gated experts
for time series: Discovering regimes and avoiding overﬁtting. In International
Journal of Neural Systems, volume 6, pages 373–399, 1995.
BIBLIOGRAPHY 90
[Wu83] C. F. J. Wu. On the Convergence Properties of the EM Algorithm. The Annals
of Statistics, 11:95–103, 1983.
[XJ96] L. Xu and M. Jordan. On Convergence Properties of the EM Algorithm for
Gaussian Mixtures. Neural Computation, 8(1):129–151, 1996.
[XLP00] M. Parizeau X. Li and R. Plamondon. Training Hidden Markov Models with
Multiple ObservationsA Combinatorial Method. In IEEE Transaction on
Pattern Analysis and Machine Intelligence, volume 22(4), pages 371–377. IEEE
pressing, 2000.
[ZG02] S. Zhong and J. Ghosh. HMMs and coupled HMMs for multichannel EEG
classiﬁcation. Proceedings of the IEEE International Joint Conference on Neural
Networks, 2:1254–1159, 2002.
APPROVAL
Name: Degree: Title of thesis:
Yingjian Zhang Master of Applied Science Prediction of Financial Time Series with Hidden Markov Models
Examining Committee:
Dr. Ghassan Hamarneh Chair
Dr. Anoop Sarkar Assistant Professor, Computing Science Senior Supervisor
Dr. Andrey Pavlov Assistant Professor, Business Administration CoSenior Supervisor
Dr. Oliver Schulte Assistant Professor, Computing Science Supervisor
Dr. Martin Ester Associate Professor, Computing Science SFU Examiner
Date Approved:
ii
Abstract
In this thesis, we develop an extension of the Hidden Markov Model (HMM) that addresses two of the most important challenges of ﬁnancial time series modeling: nonstationary and nonlinearity. Speciﬁcally, we extend the HMM to include a novel exponentially weighted ExpectationMaximization (EM) algorithm to handle these two challenges. We show that this extension allows the HMM algorithm to model not only sequence data but also dynamic ﬁnancial time series. We show the update rules for the HMM parameters can be written in a form of exponential moving averages of the model variables so that we can take the advantage of existing technical analysis techniques. We further propose a double weighted EM algorithm that is able to adjust training sensitivity automatically. Convergence results for the proposed algorithms are proved using techniques from the EM Theorem. Experimental results show that our models consistently beat the S&P 500 Index over ﬁve 400day testing periods from 1994 to 2002, including both bull and bear markets. Our models also consistently outperform the top 5 S&P 500 mutual funds in terms of the Sharpe Ratio.
iii
To my mom
iv
Acknowledgments
I would like to express my deep gratitude to my senior supervisor, Dr. Anoop Sarkar for his great patience, detailed instructions and insightful comments at every stage of this thesis. What I have learned from him goes beyond the knowledge he taught me; his attitude toward research, carefulness and commitment has and will have a profound impact on my future academic activities. I am also indebted to my cosenior supervisor, Dr. Andrey Pavlov, for trusting me before I realize my results. I still remember the day when I ﬁrst presented to him the ”super ﬁve day pattern” program. He is the source of my knowledge in ﬁnance that I used in this thesis. Without the countless time he spent with me, this thesis surely would not have been impossible. I cannot thank enough my supervisor, Dr. Oliver Schulte. Part of this thesis derives from a course project I did for his machine learning course in which he introduced me the world of machine learning and interdisciplinary research approaches. Many of my fellow graduate students oﬀered great help during my study and research at the computing science school. I thank Wei Luo for helping me with many issues regarding
A the L TEXand Unix; Cheng Lu and Chris Demwell for sharing their experiences with Hidden
Markov Models; my labmates at the Natural Language Lab for the comments at my practice talk and for creating such a nice environment in which to work. I would also like to thank Mr. Daniel Brimm for the comments from the point of view of a real life business professional. Finally I would like to thank my mom for all that she has done for me. Words cannot express one millionth of my gratitude to her but time will prove the return of her immense love.
v
Contents
Approval Abstract Dedication Acknowledgments Contents List of Tables List of Figures 1 Introduction 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 Financial Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Return of Financial Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . The Standard & Poor 500 data . . . . . . . . . . . . . . . . . . . . . . . . . . Traditional Techniques and Their Limitations . . . . . . . . . . . . . . . . . . Noneﬃciency of Financial Markets . . . . . . . . . . . . . . . . . . . . . . . . Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii iii iv v vi ix x 1 1 2 3 5 6 7 9
Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 13
2 Hidden Markov Models 2.1
Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
vi
. . . . . . . . .1 2. . . . . . . . . . . 39 Dynamic Training Pool 3. . . . . . . . . . . .7 Maximumlikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 3. . . .1 4. 42 Making predictions . . . . . 22 2.7. . . . . . . .7. . . . .3 Previous Work . . . . . . . . . . . . . . 32 BaumWelch Reestimation for Gaussian mixtures . . . . . . . . . . . . 47 Rewriting the Q Function . . . . . . . . . . . . . . . . . . . . . . . . . .3.1 3. . . . . . . . . . . 47 The Traditional EM Algorithm . . . . . . . . . . . . . . . . . . . . . 38 . . . 19 Discrete and continuous observation probability distribution . . 20 2. . . . . . . . . . . . . . . . . . . . . .5 3. . . . . 45 46 4 Exponentially Weighted EM Algorithm for HMM 4. . . .2 Multivariate Gaussian distributions . . . . . . . . .2. . . . . . .3. . . . . . . . . . . . . . 41 3.1 2. . . .6. . .6 4. . . . . . . . . . . .2 2. . . . . . . . . . . . . . . . . . . . 42 3. . . . . . . . . . . . . . .6. . . . . . . . . . . . . . .7 Chapter Summary . . . . . . . . . . . .4 2. . . . . . .5 4. . . . . . . . . . . . . . . . . . . . . .4 3. . . . . . . . . . .6 A Markov chain of stock market movement . . . . . . . . . . . . . . . . . . .7. . . . . . . . . . . . .2 2. . . .6. . . 50 HMGM Parameter Reestimation with the Weighted EM Algorithm . . . .8 Chapter Summary . . . . . . . . 34 Multivariate Gaussian Mixture as Observation Density Function . . . . .7 Three Fundamental Problems for HMMs . . . . . . . . . . . . . . . . . . . .2 Discrete observation probability distribution . . . . . . . . . 42 Setup of HMM Parameters . 54 Calculation and Properties of the Exponential Moving Average . . . 55 HMGM Parameters Reestimation in a Form of EMA . 16 Why HMM? . . . . . . . . . . . . . . . . . . . . 49 Weighted EM Algorithm for Gaussian Mixtures . .6. .3 4. . . . .3 2. . . . . . . . . . .4 4. . . . . .5 2. .1 3. . . . . . . 31 32 3 HMM for Financial Time Series Analysis 3. . . . .6 Multiple Observation Sequences . . 58 vii . . . . . . . . . . . . . . . 20 Continuous observation probability distribution . . . . . . . . . . . . . . . . . . 23 Problem 2: Finding the best state sequence . . . . . .2 4. . . . . . 38 EM for multivariate Gaussian mixtures . . . . . . . . . . . . . . . . 21 Problem 1: ﬁnding the probability of an observation . . . . . . . . .3 2. 37 3. . . . . . . . . . . . . . . 18 General form of an HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2. . . . . . . .2 3. . 14 Hidden Markov Models . . . . . 27 Problem 3: Parameter estimation . . . . . . . .2 Making Predictions with HMM . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .4. . . 82 84 86 A Terminology List Bibliography viii . . . . . . . . . . 65 Recurrent Neural Network Results . . . 71 Comparing the Sharpe Ratio with the Top Five S&P 500 Mutual Funds . . . . 61 The Double Weighted EM Algorithm . . . . . . . . . . . . . . . . . . .2 Problem with the Exponentially Weighted Algorithm . . . . . . . . . . . . . . .1 6.9. . . . 63 4. . . . . . . 79 81 6 Conclusion and Future Work 6. . . . . . . 81 Future Work . . . . . . . . . . . . . . . . . . . 67 HMGM Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 4. . . .9 The EM Theorem and Convergence Property . . .5 5. . . . . . . . . . . . . . . . . . . 64 5 Experimental Results 5. . . . . . . . .6 Experiments with Synthetic Data Set 65 . . .8 4. . .2 5. . . . . . . . . . . . . . . 63 Double Exponentially Weighted Algorithm . . .9. . . . . . . . . . . 72 Comparisons with Previous Work . . . . . . . . . . . .4 5. . 63 4.10 Chapter Summary . . . . . . . .1 5. . . . . .2 Conclusion . . . . . . 69 Test on Data Sets from Diﬀerent Time Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 5. . .
. . . . .1 Data format of the experiment (for the segment from Jan 2. . . . . . . . .List of Tables 1. . . . 12 State transition probability matrix of the HMM for stock price forecast. . 18 5.2 2. .1 Comparison of the recurrent network and exponentially weighted HMGM on the synthetic data testing. . . . . . . . . . . . . . . . . . . 78 ix . . . 67 5. . . As the market move from one day to another. . . . . . . . . . . . . . . . . . . . . 2002) . . the professional might change his strategy without letting others know . . . . . . . . .3 Comparison of the Sharpe Ratios during the period of March 8. . . . . . . . . . . . . . . . The EWHMGM outperforms the recurrent network in overall result. . . . 2002 to Jan 7. 2002. . 5 Notation list . . . . . . . . . . . . . 2001 to October 11. . . . . . . . 1999 to March 7. . . . . . . . 78 5. . . . The absolute value of the positive one is much smaller than the HMGM prediction. Four of the ﬁve funds have negative Sharpe Ratios. . . . . . and has less volatility. . . . . . . The Double weighted HMGM consistently outperforms the top ﬁve S&P 500 mutual funds. . . . . .1 1. .2 Comparison of the Sharpe Ratio between the HMM and the top 5 S&P 500 funds during the period of August 5. . . . . . . . . . . . . . 2001. . . . . . . . . . . . . . . .
. . . . Each of them will contributing a percentage to the combined output in the mixture. . . . . . . . . . . . . However. 35 x . . . .List of Figures 1. . . LD]. . .2 The return curve of a prediction system. . . . . .3 The trellis of the HMM. . . SD. . . . . . . . . . 29 Three single Gaussian density functions. . . . . . . The states are associated with diﬀerent strategies and are connected to each other. . the pattern diminished in the early 1990s.1 while the index drops below 1. 15 2. . 4 A Super Five Day Pattern in the NASDAQ Composite Index from 1984 to 1990. . . . . . . . . . . . . . . . Each strategy is associated with a probability of generating the possible stock price movement. . . . . . . . . . . . . . 8 2.1 Modeling the possible movement directions of a stock as a Markov chain. the model moves from one state to another. . . . $1 input at the beginning of the trading period generates a return of $2. . . UC. . . . . . . 17 2. . . . . Each state produces an observation picked from the vector [LR. Each of the nodes in the graph represent a move. . . . . . . . As time goes by. . . . . . . . . 24 The backward procedure .1 1. . . . . . . . . . . . . . . . .4 2. . 17 2. .6 3. . . just as the stock price ﬂuctuates everyday. . . . . . . only the professional himself knows it. . . . . . . . . 26 Finding best path with the Viterbi algorithm . . . . . . . .2 The ﬁgure shows the set of strategies that a ﬁnancial professional can take. . . . . . .5 2. . The strategies are modeled as states. . . . . . These probabilities are hidden to the public. . . . .1 The forward procedure . . . . . . . . SR. . . . . . . The ﬁrst three days and the last two days tend to have higher return than the rest of the trading days. . . . . . .
. . (picture from stockcharts. . . As is shown in the ﬁgure. . . . . . . . . . . . . . . . 68 Prediction of the NASDAQ Index in 2000 with a recurrent neural network. 30% of probability that the distribution of x is governed by the second Gaussian. . . . . .4 5. 66 Prediction of the NASDAQ Index in 1998 with a recurrent neural network. 69 . . . . The data earlier than day 10 take up a very small portion but they are still included in the overall calculation . . . . . . . .2 A Gaussian mixture derived from the three Gaussian densities above. . . . . . . 68 Prediction of the 400day S&P 500 Index starting from Jan 2. .1 5. . . . . .2 The portion of each of the ﬁrst 19 days’ EMA in calculating the EMA for the 20th day. . . . . Training window size = 20 Prediction of the 400day S&P 500 Index starting from Jan 2. . . . 72 Prediction with HMGM for 400 days of the S&P 500 Index. 63 5. . . . . .6 5. . the trend is considered bullish. . . . . . .3 3. . . . . . . . . . . . . . . . . . . .3. 41 Hidden Markov Gaussian Mixtures . . . . the trend is considered bearish. . . 56 4. . . . .9 Experimental results on computer generated data. . . . . the HMM was not able to catch up with the change in trend swiftly. 36 3. . . . . 67 Prediction of the NASDAQ Index in 1999 with a recurrent neural network. . . . . The last day account for 1/3 of the EMA at day 20.7 5. . Training window size = 10 . . . Training window size = 100 . . . . and 50% of probability by the third Gaussian. . . . .4 4. . When the 30day moving average declines below the 100day moving average. . . . 1998. . . . . . . . . . . . . . . . . . . . . 58 4. . . . . . . . 1998. . . . . . 1998. . . . 1998. . . . . . . . . . the EMA at the 19th day accounts for 2/3 of the EMA at day 20. . . . . . .3 5. .4 The prediction falls behind the market during highly volatile periods of time. . . . 59 4.8 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . there is a probability of 20% that the random variable x is governed by the ﬁrst Gaussian density function. . . . . . . . . . . . . . . . .1 The dynamic training pool . . . . . . . . . . . . . . . This is because the system is too sensitive to recent changes. . Training window size = 40 Prediction of the 400day S&P 500 Index starting from Jan 2. . . . Note from t213 to t228. . . . . . . . . . . . . . . . . . . . . . . . . For each small area δ close to each point x on the x axis. . . 43 When the 30day moving average moves above the 100day moving average.3 The portion of each of the 20 days’ actual data in calculating the EMA for the 20th day. . .com) . . . . .5 5. . 71 Prediction of the 400day S&P 500 Index starting from Jan 2. . . . . . . . . . . . . 73 xi . . . .2 5. . . . . . . . 70 . . .
. . . . . 1999 to March 7. . 1994 to June 3. . . . . . . . 2001 to October 11. . . . . . . . . . 75 5. . . . . . . . . .14 Results on the period from November 2. 73 5. 77 5. 1996 . 2001 . . . . . . . . 1998 .16 Results on the period from August 5. .13 Results from all models for comparison. . . . . . 74 5.18 Too many positive signals result in a curve similar to the index. . . . . . . . .5.17 Results on the period from March 8. 75 5. . . 80 xii . . . . . . . . 76 5. . . . . . .12 Prediction with double weighted HMGM that has a conﬁdence adjustment variable. . . . . . . . 2002 . . . . 74 5. .10 Prediction with multivariate HMGM for 400 days of the S&P 500 Index. . . . . . . 1996 to December 31. . . 76 5.15 Results on the period from June 4. . . . . .11 Prediction with single exponentially weighted HMGM . . . . . . . .
the limitations of regression models. 1. Similar to traditional ﬁnancial engineering. In this thesis. the average rate of return is the Dow Jones Industrial Index . The organization of the thesis is also outlined in the later part this chapter. This chapter describes the basic concepts in ﬁnancial time series. we apply a machien learning model–the Hidden Markov Model to ﬁnancial time series. Over the past 50 years. In this thesis we will focus our research on the most volatile and challenging ﬁnancial market – the stock market. we build models to analyze and predict ﬁnancial time series. All the ﬁnancial analysts carry out many kinds of research in the hope of accomplishing one goal: to beat the market.Chapter 1 Introduction This thesis explores a novel model in the interdisciplinary area of computational ﬁnance. which is a weighted average of 30 signiﬁcant stocks. and the S&P 1 .1 Financial Time Series Financial time series data are a sequence of prices of some ﬁnancial assets over a speciﬁc period of time. the diﬃculties of accurate ﬁnancial time series prediction. motivation and objective of our machine learning model. What distinguish us from the previous work is rather than building models based on regression equations. our experiment data include the daily data of NASDAQ Composite Index. the Dow Jones Industrial Index and the S&P 500 Index. More speciﬁcally. To beat the market means to have a rate of return that is consistently higher than the average return of the market while keeping the same level of risk as the market. there has been a large amount of intensive research on the stock market.
For example. If fair value is not equal to the current stock price. history repeats itself. We will discuss the market eﬃciency hypothesis later in this chapter and the evidence of noneﬃciency in the market. fundamental analysts believe that the stock is either over or under valued and the market price will ultimately gravitate toward fair value. and company analyses to derive the stock’s current fair value and forecast its future value. In our experiment we use the S&P 500 index as the benchmark. Usually we have a benchmark series to compare against. The oneperiod simple net return R t of a ﬁnancial time series is calculated with the following formula: . price movements are not totally random. Fundamental analysis is the examination of the underlying forces that aﬀect the wellbeing of the economy. the current price has already changed to reﬂect the most updated information. The theoretical basis of technical analysis is: price reﬂects all the relevant information. There are some inherent weaknesses in fundamental analysis. which is a combination of 500 stocks. Second. we use forecast and prediction alternatively to represent the process of generating unseen data in a ﬁnancial time series. to forecast future prices of an individual stock. First. Generally there are two main streams in ﬁnancial market analysis: fundamental analysis and technical analysis.CHAPTER 1. Let Pt be the price of an asset at time index t. industry. fundamental analysis combines economic. it is very time consuming: by the time fundamental analysts collect information of the company or economy health and give their prediction of the future price. INTRODUCTION 2 500 Index.2 Return of Financial Time Series One of the most important indicators of measuring the performance of mutual funds or prediction systems is the return of the system. there is an inevitable analyst bias and accompanying subjectivity. And all these are derived from the belief in the noneﬃciency of the ﬁnancial markets. Technical analysis is the examination of past price movements to forecast future price movements. In this thesis. industry sectors. We will focus only on methods of technical analysis in this thesis. In our experiments we base our forecast of the daily stock price on the past daily prices. 1. and individual companies.
which is an average of the 30 component stock prices.2) In developing a prediction system. This is the upper bound for the return. we will get $2. If one get an accuracy 100% for any 400 consecutive days in the ten year period 1987 to 1994. .1 at the end of the period. The higher the prediction accuracy. If we invest 1 dollar at the beginning of the period. INTRODUCTION 3 Rt = Pt − Pt−1 Pt−1 (1. The S&P 500 Index is viewed as one of the best indicators of the US economy health. two days of accurate predictions of small changes may not oﬀset the loss of one day’s incorrect prediction of a large change in the market. This will result in a return curve for the prediction system. (1 + Rk ) k = t=1 (1 + Rt ) (1.1) For a kperiod ﬁnancial time series. the return at the end of the period would be roughly 25 times the initial input. The index reﬂects the total market value of all component stocks relative to a particular period of time.3 The Standard & Poor 500 data The real world ﬁnancial time series we input into the HMM prediction system is the daily return of the Standard & Poor 500 Index (S&P 500). Prediction of ﬁnancial time series is a very extensive topic. the higher the return curve will be above the index curve. the S&P 500 represents the market value of the 500 component stocks.1 shows the return based on a strategy using some prediction system. The S&P 500 index is a weighted average of the stock prices of 500 leading companies in leading industry sectors of the US. we only predict the direction of the time series. the return at the end of the period is a cumulative return that is composed of all the periods: 1 + Pt [k] = (1 + R1 )(1 + R2 ) . In this thesis.CHAPTER 1. generating a signal of either up or down for each trading day. For example. Figure 1. Unlike the Dow Jones Industrial Average. To . 1. . Normally we care more about the return than the accuracy rate of the prediction because the simple net return at each day is diﬀerent. our goal is to make as many accurate predictions as possible.
The return of the S&P 500 Index for each day is calculated using the techniques introduction in Section 1. calculate the S&P 500 Index. An example of the data format is shown in Table 1.1).4 1.1 while the index drops below 1. These are the reverse signals in a long term trend.1: . The Dow is a less sensitive index than the S&P 500 so we use it to help smooth the prediction. also called the whipsaws(see 3. In some scenarios of the experiments we also include the Dow Jones Index that span during the same time period as the S&P 500 Index. In some experiments we use a second sequence of data: the volume. the S&P 500 has some diﬀerent characteristics from the Dow. One reason to use another time series to facilitate the prediction is that there are sideway movements in the long term trends. then divide them by the total number of outstanding shares in the market.8 0 50 100 150 200 trading day 250 300 350 400 Figure 1. We need to identify the real trend change signals from the noisy whipsaws. INTRODUCTION 4 2. The price we use for the calculation is the close price for each day.1: The return curve of a prediction system. Because it looks at the total market value and includes stocks from diﬀerent industries than the Dow.6 return 1. $1 input at the beginning of the trading period generates a return of $2. we ﬁrst add up the total market value of all the 500 stocks.CHAPTER 1.2.2 1 0.2 prediction market 2 1.8 1.
We have to identify whether an irrational price movement is noise or is part of a pattern. Then work out the weights given some form of noise.CHAPTER 1. As these models are independent of the past prices. people use various forms of statistical models to predict volatile ﬁnancial time series [GP86]. some of these parameters must be estimated based on other current situations. as the Dow Jones Industrial Index did on Monday. assume the time series follow a linear pattern. 2 one needs to use a nonlinear regression such as y t = a × yt−1 + b × yt−1 . The process of the estimation of these parameters will bring in human analyst bias. But sometimes this linear regression predicting the target values from the time. in the BlackScholes option pricing model.0065 Volume 1418169984 1874700032 1943559936 1682759936 Table 1.89 Return 0. The BlackScholes model is a statistical model for pricing the stock option at a future time based on a few parameters.27 1172. where nonlinear formulation is very hard to ﬁnd. y t = a + b × yt−1 + noise.00918 0.67 1165. October 19. We then ﬁnd a a and b are regression weights. The econometrics method usually proposes a statistical model based on a set of parameters. Sometimes stocks move without any apparent news. Before we go on to the machine learning solution to the problem. For nonlinear trends. .006213 0. The basic idea is as follows: To make things simple.4 Traditional Techniques and Their Limitations Traditionally. INTRODUCTION 5 Date Jan22002 Jan32003 Jan42002 Jan72002 Close 1154.51 1164. Most of these statistical models require one assumption: the form of population distribution such as Gaussian distributions for stock markets or binomial distribution for bonds and debts markets [MCG04]. let’s take a short review of the econometrics method.00574 0. 1987. It fell by 22. For example. This method is rather crude but may work for deterministic linear trends. Perhaps the most prominent of all these models is the BlackScholes model for stock option pricing [BS73] [Hul00].6 percent without any obvious reason.1: Data format of the experiment (for the segment from Jan 2. Classical econometrics has a large literature for the time series forecast problem. 2002) 1. 2002 to Jan 7.
. The choosing of the price segment and the assumption that 1. Another inherent problem of the statistical methods is that sometimes there are price movements that cannot be modeled with one single probability density function (pdf ). Fama [Fam65] [Fam70] deﬁned an eﬃcient ﬁnancial market as one in which security prices always fully reﬂects the available information. Finally. r is the interest rate. INTRODUCTION 6 the noarbitrage1 option cost is calculated as: C = C(s. r) = e−rt E[(seW − K)+ ] = E[(se−σ 2 t/2+σ = E[(seW −rt − Ke−rt )+ ] √ t − Ke−rt )+ ] (1. These four pdf s must work together with each of them providing a contribution at diﬀerent time points. Arbitrage opportunities usually surface after a takeover oﬀer. The only way to decide it normal random distribution.CHAPTER 1. t is the exercise time of the option. these are four distributions that determine the diﬀerence between the logarithm of tomorrow’s price and the logarithm of today’s price. This can be caused by the rapid change in the data patterns. In [Ros03]. there are times that we cannot ﬁnd the best approximation function.3) s is the security’s initial price. an average investor cannot hope to 1 Arbitrage is the simultaneous buying and selling of securities to take advantage of price discrepancies. These four parameters are ﬁxed and can be easily identiﬁed. It has been agreed that no single model can make accurate predictions all the time. then assume the logarithm of this data follows the the logarithm of the prices follows the normal random distribution is rather subjective. there are diﬀerent patterns thus cannot be described with one stationary model. In plain English.5 Noneﬃciency of Financial Markets In a classic statement on the Eﬃcient Market Hypothesis (EMH). we see an experiment on the crude oil data. K is the strike price. For diﬀerent time windows. t. normal random variable with mean (r − σ 2 /2) and variance σ 2 t. K. W is a is to pick a segment of past price data. The EMH “rules out the possibility of trading systems based only on currently available information that have expected proﬁts or returns in excess of equilibrium expected proﬁt or return” [Fam70]. σ. The example shows rather than just one function.
6 Objective Financial time series consists of multidimensional and complex nonlinear data that result in of pitfalls and diﬃculties for accurate prediction. data multiplication [WG93]. They derive enhanced features of interest from the raw time series data. The shape of the return curve thus looks like a parabola. Interestingly the January eﬀect has disappeared in the last 15 years. This is called a random walk of the market. Data mining people have developed some very useful techniques of dealing with data. [BWT85] compare the performance of two groups of companies: extreme losers and extreme winners. The EMH was met with harsh challenges both theoretically and empirically shortly after it was proposed. We believe that a successful ﬁnancial time series prediction system should be hybrid and custom made. there are superior returns around the end of the year and in early January than other times of the year. A more prominent example is the socalled January Eﬀect [Wac42] [Tha87]. . INTRODUCTION 7 consistently beat the market. which means it requires a perfect combination of sophistication in the prediction system design and insight in the input data preprocessing. There is a super ﬁve day pattern in the NASDAQ Composite Index from 1984 to 1990. Both areas have been extensively researched. such as missing values [HK01]. But this does not mean the market is fully eﬃcient as the process took a long time and there may have evolved other new patterns that are yet to ﬁnd out. After 1990. They study the stock price of two groups of companies from 1933 to 1985. and the best prediction for a future price is the current price. This phenomenon was observed many years after Wachtel’s initial observation and discussion. For small companies. This means the ﬁrst three days and the last two days of each month generate a higher return in the index than any other day of the month.CHAPTER 1. the pattern became more random and less recognizable. Traders and technical analysts also have done lots of work. If equal performance is obtained on the random walk data as compared to the real data then no signiﬁcant predictability has been found. Our own program also ﬁnds some interesting patterns. De Bondt et. Price by Volume. They ﬁnd that the portfolios of the winners chosen in the previous three years have relatively low returns than the loser’s portfolio over the following ﬁve years. such as the variety of indicators including Moving Average Convergence Divergence (MACD). 1.
2: A Super Five Day Pattern in the NASDAQ Composite Index from 1984 to 1990. The ﬁrst three days and the last two days tend to have higher return than the rest of the trading days.CHAPTER 1. However. INTRODUCTION 8 80 daily return 60 40 20 return 0 20 40 60 0 5 10 trading day 15 20 25 Figure 1. the pattern diminished in the early 1990s. .
Longterm and shortterm series balance is then found throughout the combined trainingforecast process. INTRODUCTION 9 Relative Strength Index (RSI) etc. This thesis aims at solving all these three problems with a Hidden Markov Model(HMM) based forecasting system. the weight of each Gaussian component in the output. Genetic Algorithm [Deb94]. such as Back Propagation Network [CWL96]. the patterns of wellknown ﬁnancial time series are dynamic. an eﬃcient system must be able to adjusting its sensitivity as time goes by. Gaussian mixtures have desirable features in that they can model series that does not fall into the single Gaussian distribution. it is usually hard to determine the usefulness of information. some of which have been conﬁrmed to pertain useful information [STW99].CHAPTER 1. Then the parameters of the HMM are updated each iteration with an adddrop ExpectationMaximization (EM) algorithm. The volatility is calculated as the divergence of the S&P Index from the Dow Jones Composite Index. Hidden Markov Models (HMM)[SW97] and so on.7 Contributions of the Thesis This thesis is an interdisciplinary work that combines computer science and ﬁnancial engineering. The major contributions are: . the parameters of the Gaussian mixtures.e. Misleading information must be identiﬁed and eliminated. i. there have been several straightforward approaches that apply a single algorithm to relatively unprocessed data. 1. there is no single ﬁxed models that work all the time. At each timing point. The HMM has a Gaussian mixture at each state as the forecast generator. Third. There are three major diﬃculties about accurate forecast of ﬁnancial time series. First. Second. it is hard to balance between longterm trend and shortterm sideway movements during the training process. In other words. Techniques from the two areas are interweaved. The revised EM algorithm gives more weight to the most recent data while at the same time adjust the weight according to the change in volatility. On the other hand. Nearest Neighbor [Mit97]. and the transformation matrix are all updated dynamically to cater to the nonstationary ﬁnancial time series data. It is included as new information that help make better judgment. We then develop a novel exponentially weighted EM algorithm that solves the second and third diﬃculties at one stroke.
We develop a novel dynamic training scheme. 8. Then in chapter 5. We show the weighted EM algorithms can be written in a form of Exponentially Moving Averages of the model variables of the HMM. This algorithm realizes that recent data should be given more attention and enables us to recognize trend changes faster than the traditional EM algorithm. 1. The concept of the dynamically updated training pool enables the system to absorb newest information from data.CHAPTER 1. We develop an exponentially weighted EM algorithm. 3.8 Thesis Organization Chapter 2 of this thesis introduces general form of HMMs. We also derive our revised exponentially . especially when training data set is relatively large. 5. 4. we propose a double weighed EM algorithm. 6. This allows us to take the advantage of this existing econometrics technique in generating the EMbased reestimation equations. including multiple observation sequence and multivariate Gaussian mixtures as pdfs. we prove that the convergence of the two proposed algorithms is guaranteed. Experimental results show our model consistently outperform the index fund in returns over ﬁve 400day testing periods from 1994 to 2002. 2. Chapter 3 describes an HMM with Gaussian mixtures for the observation probability density functions and derives the parameter reestimation formulas based on the classic EM algorithm. 7. some detailed aspects of the HMM are studied. The double weighted EM algorithm solves the drawback of the single exponentially weighted EM caused by oversensitivity. Using the techniques from the EM Theorem. Based on the exponentially weighted EM algorithm. INTRODUCTION 10 1. There are few applications of Hidden Markov Model with Gaussian Mixtures in timedependent sequence data. The proposed model also consistently outperforms toplisted S&P 500 mutual funds on the Sharpe Ratio. This is one of the ﬁrst large scale applications to ﬁnancial time series.
INTRODUCTION 11 weighted EM algorithm and the new parameter updating rules in this chapter. 1. 3 and 4. . Future work that indicate possible improvements of the model are also included in chapter 6.9 Notation We list the Roman and Greek letters used in this thesis for reference. Chapter 6 shows the experimental results and concludes the thesis. These symbols are used in the derivations of the reestimation formulas and proof in chapters 2.CHAPTER 1.
of the best sequence ending in state j at time t Prob. of the best sequence ending at the m th component of the ith state The node of the incoming arc that lead to the best path Observed data Hidden part governed by a stochastic process Complete data set Exponential weight imposed on intermediate variables in EM Selfadjustable weight in double weighted EM The multiplier in calculating EMA The EMA of δim at time T The EMA of δi at time T Parameters of the j th component in a Gaussian mixture Parameters of a Gaussian mixture The likelihood function Table 1. j) γi (t) δi (t) δim (t) ψt X Y Z η ζ ρ δim (T ) δi (T ) θj Θ L Notation The state at time t Probability of being at state i at time 0 Probability of moving from state i to state j Probability of being at state j and observing o t Mixture weight for the k th component at state j The observation sequence The tth observation Forward variable in HMM Backward variable in HMM Prob.2: Notation list . of being at state i at time t Prob. of being at state i at time t. and state j at t + 1 Prob.CHAPTER 1. INTRODUCTION 12 Symbol st πi aij bj (ot ) wjk O ot αi (t) βi (t) pt (i.
Consider slot t. This is called a ﬁrst order Markov process. the prediction of the next state and its associated observation only depends on the current state. For example. then the characteristics of the HMMs. S = {1. each observation in the data sequence depends on previous elements in the sequence. At each discrete time 13 . and other sequence data. . 2. There is no need to look into the admission rate of the previous a system where there are a set of distinct states.Chapter 2 Hidden Markov Models This chapters describes the general Markov process and Hidden Markov Models (HMMs). such as speech utterances. . the system takes a move to one of the states according to a set of state transition probabilities P . We start by introducing the properties of the Markov process. temperature variations. . We denote the state at time t as s t . . knowing the number of students admitted this year might be adequate to predict next year’s admission. 2. [RJ86] and [Rab89] provide a thorough introduction to the Markov process. In many cases. Hidden Markov Models and the main problems involved with Hidden Markov Models. meaning that the state transition probabilities do not depend on the whole history of the past process. N }. These discussions form the base of our modeling and experiments with ﬁnancial time series data.1 Markov Models Markov models are used to train and recognize sequential data. In a Markov model. biological sequences.
. 2. the initial state 2. In the ﬁgure. no change. 5 represent a possible movement vector {large rise. (2. . HIDDEN MARKOV MODELS 14 years.7) (2. 2.1) (2. ∀i. π) = P (1)P (21)P (32)P (23)P (22)P (42) = π ∗ a12 ∗ a23 ∗ a32 ∗ a22 ∗ a24 First we can deﬁne the state sequence as X = {1. rise.2 A Markov chain of stock market movement In this section we use a Markov chain to represent stock movement. 2. . no change. 2.2) Because of the state transition is independent of time. 4. 2.1). Given the Markov chain. π). small From the model in ﬁgure we are able to answer some interesting questions about how this stock behaves over time. P (XA. jΣN aij = 1. no change. j=1 distribution: πi = P (X1 = si ) Here. (2.CHAPTER 2. can be calculated as: P (XA. 3. no change. π) = P (1. large drop}.5) (2. i=1 In a visible Markov model. we can have the following state Also we need to know the probability to start from a certain state. 4}.8) .6) (2. 3. hence: aij ≥ 0. small drop. 2. nodes 1. the states from which the observations are produced and the probabilistic functions are known so we can regard the state sequence as the output. small rise. . ΣN πi = 1. 3. For example: What is the probability of seeing “small drop.4) (2.3) (2. 4A. Xt ) = P (Xt+1 = sk Xt ) (Limited horizon) = P (X2 = sk X1 ) (Time invariant) transition matrix A: aij = P (Xt+1 = sj Xt = si ) aij is a probability. These are the Markov properties described in [MS99]: P (Xt+1 = sk X1 . We present a ﬁve state model to model the movement of a certain stock(See Figure 2. 2. large rise” in six consecutive days? the probability of seeing the above sequence.
1: Modeling the possible movement directions of a stock as a Markov chain. HIDDEN MARKOV MODELS 15 sd/0.200 lr/0.100 Figure 2.200 sd/0.100 ld/0.100 lr/0.200 ld/0. just as the stock price ﬂuctuates everyday.200 ld/0.400 uc/0.200 uc/0.100 0 ld/0.200 sr/0. .400 4 lr/0.200 sr/0.100 2 uc/0.CHAPTER 2. Each of the nodes in the graph represent a move.100 1 sd/0.400 ld/0.100 uc/0. the model moves from one state to another. As time goes by.400 sr/0.200 3 sd/0.100 lr/0.200 sr/0.200 uc/0.100 sr/0.100 lr/0.400 sd/0.
. In an HMM. small drop. see Figure 2.3 and Table 2. and large rise) on a certain day. . .10) (2. think of HMM as a ﬁnancial expert.9) (2. π) = P (X1 )P (X2 X1 )P (X3 X1 . an observation ot is generated by a probabilistic function b j (ot ). The number of states. Therefore we extend it to a model with greater representation power.2 shows the example of an HMM that models an investment professional’s prediction about the market based on a set of investment strategies. and from which state an observation is generated are all unknown. the transition probabilities. X2 ) · · · P (XT X1 . Each strategy has a diﬀerent probability distribution for all the movement symbols.11) πXt Xt+1 2. Assume the professional has ﬁve investment strategies. . . small rise} if the prediction starts by taking strategy To answer the ﬁrst question.1. the Hidden Markov Model (HMM). etc). . small rise. no change. XT −1 ) = P (X1 )P (X2 X1 )P (X3 X2 ) · · · P (XT XT −1 ) = T −1 t=1 (2. each state of the HMM is associated with a probabilistic function. . the probability of a state sequence X 1 . The movement set has all the movement symbols and each symbol is associated with one state. each of them gives a distinctive prediction about the market movement (large drop.3 Hidden Markov Models The Markov model described in the previous section has limited power in many applications. with the probability: bj (ot ) = P (ot Xt = j) (2. Each state is connected to all the other states with a transition probability distribution. Instead of combining each state with a deterministic output(such as large rise. small drop. HIDDEN MARKOV MODELS 16 More generally. Each of the ﬁve strategies is a state in the HMM. which is associated with state j. .12) Figure 2. Now how does the professional make his prediction? What is the probability of seeing 2 on the ﬁrst day? the prediction sequence {small drop. one does not know anything about what generates the observation sequence. Also assume he performs market prediction everyday with only one strategy. At time t. . XT can be calculated as the product of the transition probabilities: P (XA.CHAPTER 2.
SR. the HMM will pick a particular strategy to follow one that was taken based also on the observation probability. The steps for generating the prediction sequence: .3: The trellis of the HMM. Introducing density functions at the states of the HMM gives more representation power than ﬁxed strategies associated with the states. Strategy 1 Strategy 1 Strategy 1 Strategy 2 Strategy 2 Strategy 2 Strategy 3 Strategy 3 Strategy 3 Strategy 4 Strategy 4 Strategy 4 Strategy 5 Day1 Strategy 5 Day2 Strategy 5 Day3 Figure 2. These probabilities are hidden to the public.2: The ﬁgure shows the set of strategies that a ﬁnancial professional can take. The strategies are modeled as states. only the professional himself knows it. LD]. UC. Each state produces an observation picked from the vector [LR. HIDDEN MARKOV MODELS 17 Strategy 1 Strategy 2 Strategy 3 Strategy 4 Strategy 5 LR SR UC SD LD 10% 40% 20% 15% 5% 15% 30% 30% 15% 10% 5% 5% 20% 40% 30% 40% 30% 20% 5% 5% 20% 20% 20% 20% 20% Figure 2. The states are associated with diﬀerent strategies and are connected to each other. SD.CHAPTER 2. Unlike the Markov chain of the previous section. Each strategy is associated with a probability of generating the possible stock price movement. The key to the HMM performance is that it can pick the best overall sequence of strategies based on an observation sequence.
3 × 0.4 + 0.CHAPTER 2. small rise} is: 0.e. the probability of seeing a large rise prediction on the ﬁrst day if we take strategy 2 is 15 4.2 × 0. where the data can be thought of being generated by some underlying stochastic process that is not known to the public. Choose an initial state(i.2 = 0. through diﬀerent state sequences. a strategy) X 1 = i according to the initial state distribution Π. Then there are ﬁve possible states for the second day.2 × 0. go to step 3.3 + 0. if the professional’s strategies and . Get the prediction according to the strategy of the current state i.4 Why HMM? From the previous example. Set the time step t = 1. since we are at the ﬁrst day of the sequence. HIDDEN MARKOV MODELS 18 Probability Strategy1 Strategy2 Strategy3 Strategy4 Strategy5 Strategy1 10% 10% 20% 30% 30% Strategy2 10% 20% 30% 20% 20% Strategy3 20% 30% 10% 20% 20% Strategy4 30% 20% 20% 15% 15% Strategy5 30% 20% 20% 15% 15% Table 2. For example. Then we get the prediction for the current day. we can see that there is a good match of sequential data analysis and the HMM. if t ≤ T . according to the state transition probability distribution.2 × 0. Just like in the above example. 5. sum up all the probabilities the movement sequence may happen. To answer the second question.e.05 + 0. Set t=t+1.1 × 0. 2. i. We know the prediction starts with strategy 2. 3. the professional might change his strategy without letting others know 1.365 2. So the probability of seeing the movement sequence {small drop. which means the initial state is state 2. otherwise terminate. Transit to a new state Xt+1 . Especially in the case of ﬁnancial time series analysis and prediction. As the market move from one day to another.3 + 0.15 + 0. b ij (ot ).1: State transition probability matrix of the HMM for stock price forecast.
semiconductor malfunction [GS00][HG94]. what drives the ﬁnancial market and what pattern ﬁnancial time series follows have long been the interest that attracts economists. K. 2. mathematicians and most recently computer scientists. N } is the set of states. and diﬀerent time series analysis. the Expectation Maximization(EM) algorithm provides an eﬃcient class of training methods [BP66]. As shown in the example we can see the HMM has more representative power than the Markov chain. Given plenty of data that are generated by some hidden power. We compare our approach to theirs in Chapter 5. The general HMM approach framework is an unsupervised learning technique which allows new patterns to be found. In the above example. In a discrete observation density case.5 General form of an HMM An HMM is composed of a ﬁvetuple: (S. . They are widely used in speech recognition [Rab89][Cha93]. Besides the match between the properties of the HMM and the time series data. . The transition probabilities as well as the observation generation probability density function are both adjustable. . we can take the imaginary professional as the underlying power that drives the ﬁnancial market up and down. See [ADB98]. M possible movements.CHAPTER 2. is the number of observation choices. .etc. . Π. kM } is the output alphabet. . HMMs have been applied to many areas. M equals the number of . S = {1. . A. 1. The state at time t is denoted s t . We show a revised EM algorithm in later chapters that gives more weight to recent training data yet also trying to balance the conﬂicting requirements of sensitivity and accuracy. It can handle variable lengths of input sequences. K = {k1 . 2. we can create an HMM architecture and the EM algorithm allows us to ﬁnd out the best model parameters that account for the observed training data. without having to impose a “template” to learn. bioinformatics [PBM94]. Weigend and Shi [SW97] have previously applied HMMs to ﬁnancial time series analysis. . such as weather data. B). HIDDEN MARKOV MODELS 19 the number of his strategies are unknown. In fact.
. For example. aij = P (st+1 st ).6. State transition probability distribution A = {a ij }. j ∈ S.3. and assuming that some set of data was generated by the HMM. a codebook is created and the following analysis and prediction are with the domain of the codebook. the observation symbol probability density function will be in the form: bj (ot ) = P (ot = kst = j). 1 ≤ i. Initial state distribution Π = {π i }. we are able to calculate the probabilities of the observation sequence and the probable underlying state sequences. . After the vector quantization.no change. i. large rise}.15) After modeling a problem as an HMM. In ﬁnancial time series analysis. The probabilistic function for each state j is: bj (ot ) = P (ot st = j) (2. we sometimes need to partition probability density function of observations into discrete set of symbols v1 .CHAPTER 2. πi is deﬁned as πi = P (s1 = i) 4. vM . . Also we can train the model parameters based on the observed data and get a more accurate model. . In the previous example. 1 ≤ k ≤ M (2. j ≤ N (2. we might want to model the direction of the price This partitioning process is called vector quantization. small rise. Observation symbol probability distribution B = b j (ot ). small drop. i ∈ S. .16) movement as a ﬁveelement vector {large drop. HIDDEN MARKOV MODELS 20 3. [Fur01] provides a survey on some of the most commonly used vector quantization algorithms. v2 .13) 5. Then use the trained model to predict unseen data.6 2. The function at each state is called a discrete probability density function.14) (2. 2.1 Discrete and continuous observation probability distribution Discrete observation probability distribution There are ﬁxed number of possible observations in the example in section 2.
N (2. the function b j (ot ) is in the form of a continuous probability density function(pdf) or a mixture of continuous pdf s: M bj (ot ) = k=1 wjk bjk (ot ). . or Normal (2. enabling us to get as accurate a model as possible. N . M (2. Sun and other high tech stocks rather than include Exxon or McDonald’s. The dimension of the multivariate time series cannot be too large as it increases computation complexity by quadratic factor. . The most used Ddimensional logconcave or elliptically symmetric density is the Gaussian density. please refer to [Mit97] and [MS99]. However they have limited representation power. For example. it will make more sense to include the prices of Microsoft. k = 1. We will describe more about this subject in Chapter 3. . 2. . For details in mixtures . the codebook is used to estimate the probability of each possible symbol and produce the most probable one.18) Each bjk (ot )is a Ddimensional logconcave or elliptically symmetric density with mean vector µjk and covariance matrix Σjk : bj k(ot ) = N (ot . HIDDEN MARKOV MODELS 21 Using discrete probability distribution functions greatly simpliﬁes modeling of the problem. . . the continuous probability distribution functions are used during the training phase. Another problem is the selection of the pdf for each mixture. j = 1. Distribution. . . . . This Ddimensional multivariate time series training requires that we pick the most correlated time series as input.CHAPTER 2. An automatic method for dimension reduction is Principal Component Analysis(PCA).19) Then the task for training is to learn the parameters of these pdf s. . wjk ≥ 0. . j = 1. µjk . The mixture weights have the following constraints: M k=1 wjk = 1. . Σjk ) of Gaussian distributions. whose behavior have less correlation with that of IBM. .17) M is the number of the mixtures and w is the weight for each mixture. Then in the prediction step. To solve this problem. .2 Continuous observation probability distribution In a continuous observation probability HMM. in order to predict the price of IBM. j = 1. N .6. . we will need to ﬁnd out the most correlated factors to our prediction.
.CHAPTER 2.7 Three Fundamental Problems for HMMs There are three fundamental problems about the HMM that are of particular interest to us: 1. what is the underlying state sequence(1.20) 1/2 1 exp − (ot − µjk )T Σ−1 (ot − µjk ) jk 2 (2. it is about training the model with the past historical data.e. µjk . Σjk ) = √ exp − 2σ 2 2πσ where ot is a real value. how do we eﬃciently compute the probability of the observation sequence given the model? i. N ) that best “explains” the observations? 3. oT ). . HIDDEN MARKOV MODELS 22 The single Gaussian density function has the form: 1 (x − µ)2 bjk (ot ) = N (ot .3. . . and an observation sequence O = (o 1 . Only after training can we use the model to perform further predictions. and a space of possible models. which model best accounts for the movement history. µjk . . given the past history of the price or index movements. what is P (Oµ)? 2. In our previous example in section 2. Given a model µ and the observation sequence O. The multivariate Gaussian density is in the form: bjk (ot ) = N (ot . Given an observation sequence O. In the reallife ﬁnancial times series analysis case. . B. . The second problem is about ﬁnding out the hidden path. . Π).21) 2. Σjk ) = where ot is a vector. . Given a model µ = (A. how do we adjust the parameters so as to ﬁnd the model µ that maximizes P (Oµ)? The ﬁrst problem can be used to determine which of the trained models is most likely when the training observation sequence is given. In the ﬁnancial time series case. what is the strategy sequence the ﬁnancial professional applied on each trading day? The third problem is of the most signiﬁcant interest because it deals with the training of the model. . 1 (2π)D/2 Σ jk (2.
µ) = t=1 P (ot st . st = iµ) (2. st+1 . HIDDEN MARKOV MODELS 23 2. The forward procedure The forward variable αi (t) is deﬁned as: αi (t) = P (o1 o2 · · · ot−1 .28) To solve this problem. we want to ﬁnd out the probability of the sequence P (Oµ). an eﬃcient algorithm–the forward backward algorithm was developed . that cuts the computational complexity to linear in T . B. As the length of T of the sequence grows.23) = bs1 (o1 ) · · · bs1 s2 (o2 ) · · · bsT −1 sT (oT ) and the state transition probability. .CHAPTER 2. P (Sµ) = πs1 · as1 s2 · as2 s3 · · · asT −1 sT .27) The computation is quite straightforward by summing the observation probabilities for each of the possible state sequence. oT ) and an HMM µ = (A. .24) (2. Sµ) = P (OS.26) = s1 ···sT +1 π s1 t=1 ast st+1 bst st+1 ot (2. . µ)P (Sµ) Therefore. µ)P (Sµ) (2. It requires (2T − 1) · N T +1 multiplications and N T − 1 additions. This process is also known as decoding.1 Problem 1: ﬁnding the probability of an observation Given an observation sequence O = (o 1 .25) P (OS. . Since the observations are independent of each other the time t.7. The joint probability of O and S: P (O.22) (2. A major problem with this approach is enumerating each state sequence is ineﬃciency. sT ) generating the observation sequence can be calculated as: T P (OS. P (Oµ) = S T (2. Π). the computation grows exponentially. . . . µ) (2. the probability of a state sequence S = (s1 . .
Initialization αi (1) = πi . see Figure 2.4: The forward procedure 1. It is calculated by summing probabilities for all incoming arcs at a s1 s2 State s3 s n t−1 t t+1 Time Figure 2.30) .CHAPTER 2. given the observation trellis node. 1 ≤ j ≤ N (2. sequence o1 · · · ot−1 . The forward variable at each time t can be calculated inductively. HIDDEN MARKOV MODELS 24 α(t) stores the total probability of ending up in state s i at time t.4. 1 ≤ i ≤ N 2.29) αj (t + 1) = i=1 αi (t)aij bijot . Induction N (2. 1 ≤ t ≤ T.
1 ≤ t ≤ T. The backward procedure calculates the probability of the partial observation sequence from t + 1 to the end. Update time Set t = t + 1. 1 ≤ i ≤ N (2.32) (2. Termination N P (Oµ) = i=1 αi (T ) (2. The complexity is O(T N 2 ). The backward variable βi (t) is deﬁned as: βi (t) = P (ot+1 ot+2 · · · oT st = i. Total N P (Oµ) = i=1 πi βi (1) (2. given the model µ and state s i at time t.35) .33) βi (t) = j=1 aij bijot βj (t + 1). 1 ≤ i ≤ N 2. Return to step 2 if t > 0. µ) The backward algorithm works from right to left on through the same trellis: 1. Update time Set t = t − 1. The backward procedure The forward algorithm requires N (N +1)(T −1)+N multiplications and N (N −1)(T −1) The induction computation for the forward procedure can also be performed in the reverse order. Induction N (2. Otherwise. HIDDEN MARKOV MODELS 25 3.CHAPTER 2. Initialization βi (T ) = 1. 4.34) 3. Otherwise terminate algorithm. terminate algorithm 4.31) additions. Return to step 2 if t < T .
HIDDEN MARKOV MODELS 26 s1 s2 State s3 s n t−1 t t+1 Time Figure 2.CHAPTER 2.5: The backward procedure .
1 ≤ t ≤ T + 1. o1 · · · ot−1 . One of them is to choose the states that are probability variable: individually most likely at each time t.CHAPTER 2. The diﬃculty is that there may be several possible optimal criteria. based on dynamic programming. . . For each time t. µ) S It is suﬃcient to maximize for a ﬁxed observation sequence O: argmax P (S .2 Problem 2: Finding the best state sequence The second problem is to ﬁnd the best state sequence given a model and the observation sequence. 1 ≤ t ≤ T + 1. o2 .36) (2. sT ) given the observation sequence O = (o 1 .37) (2. is used to ﬁnd the best state sequence. . Therefore.40) . . S = (s1 .39) This quantity maximizes the expected number of correct states. the Viterbi algorithm. we ﬁnd the following γi (t) = P (st = iO. Oµ) = P (Oµ) αi (t)βi (t) = N j=1 αj (t)βj (t) The individually most likely state sequence S can be found as: S = argmax γi (t). oT ) : argmax P (S O. There are several possible ways to solve this problem. a more eﬃcient algorithm. s2 . µ) P (st = i. if at some point we have zero transition probability aij = 0.38) (2. . The Viterbi algorithm The Viterbi algorithm is designed to ﬁnd the most likely state sequence. HIDDEN MARKOV MODELS 27 2.7. Oµ) S The following variable: δj (t) = max P (s1 · · · st−1 . . . 1 ≤ i ≤ N 1≤i≤N (2. . st = jµ) s1 ···st−1 (2. For example. This is one a problem often encountered in parsing speech and in pattern recognition. However the approach may generate an unlikely state sequence. the optimal state sequence found may be invalid. This is because it doesn’t take the state transition probabilities into consideration.
46) (2. Path readout s∗ = ψt+1 (s∗ ) t t+1 (2. except that the forwardbackward algorithm uses summing over previous states while the Viterbi algorithm uses maximization.42) (2.47) = argmax[δi (T )] When there are multiple paths generated by the Viterbi algorithm. Initialization δi (1) = πi bi (o1 ).41) (2.48) 1≤i≤N (2. 1 ≤ i ≤ N 2. we can pick up one randomly or choose the best n paths. similar to the forwardbackward algorithm.43) (2.6 shows how the Viterbi algorithm ﬁnds a single best path. Update time t = t+1 Return to step 2 if t ≤ T else terminate the algorithm 4. Termination P∗ = s∗ T 5. The corresponding variable ψ j (t) stores the node of the incoming arc that leads to this most probable path. HIDDEN MARKOV MODELS 28 that ends in state i at time t. given the model µ.CHAPTER 2.45) max [δi (T )] 1≤i≤N (2.44) ψj (t) = argmax[δi (t − 1)aij ] 1≤i≤N 3. Induction δj (t) = bijot max δi (t − 1)aij 1≤i≤N This variable stores the probability of observing o 1 o2 · · · ot using the most likely path (2. The calculation is done by induction. 1 ≤ i ≤ N ψi (1) = 0. Fig 2. . The complete Viterbi algorithm is as follows: 1.
Deﬁne pt(i. This algorithm is called the BaumWelch. This iterative process is called the training of the model. To work out the optimal model µ = (A. B. HIDDEN MARKOV MODELS 29 1 State 2 3 1 2 3 4 Time Figure 2. The BaumWelch algorithm is numerically stable with the likelihood nondecreasing of each iteration. we want to ﬁnd the model parameters µ = (A. j ≤ N as follows: . 1 ≤ i. This is a special case of the Expectation Maximization method. we will need to deﬁne a few intermediate variables. The problem can be reformulated as ﬁnd the parameters that maximize the following probability: argmax P (Oµ) µ There is no known analytic method to choose µ to maximize P (Oµ) but we can use a local maximization algorithm to ﬁnd the highest probability. It works iteratively to improve the likelihood of P (Oµ). Given an observation sequence. π) iteratively. 1 ≤ t ≤ T. B. It has linear convergence to a local optima.6: Finding best path with the Viterbi algorithm 2.CHAPTER 2.3 Problem 3: Parameter estimation The last and most diﬃcult problem about HMMs is that of the parameter estimation. π) that best explains the observation sequence.7. j).
st+1 = j.53) (2. By repeating this process we hope to converge on the optimal values for the model parameters. j) is the expected number of transitions from state i to j. st+1 = jO.51) (2. given Then deﬁne γi (t). (2. and at state j at time t + 1.CHAPTER 2. j) (2. This is the probability of being at state i at time t. HIDDEN MARKOV MODELS 30 pt(ij) = P (st = i. given the observation O and the model µ: γi (t) = P (st = iO.49) (2. . µ) = j=1 pt (i. st+1 = jO. µ) N (2.50) (2. Oµ) = P (Oµ) αi (t)aij bijot βj (t + 1) = N m=1 αm (t)βm (t) αi (t)aij bijot βj (t + 1) = N N m=1 n=1 αm (t)amn bmnot βn (t + 1) the model µ and the observation O. µ) P (st = i. Then we can change the model to maximize the values of the paths that are used.55) The above equation holds because γi (t) is the expected number of transition from state i and pt (i.52) This is the probability of being at state i at time t.54) = j=1 N P (st = i. Given the above deﬁnitions we begin with an initial model µ and run the training data O through the current model to estimate the expectations of each model parameter.
1≤t≤T pt (i. The following model design and experiments are all extensions based on the basic model and the techniques to solve the three problems. i. . both Viterbi algorithm and the BaumWelch algorithm are used in combination to make the system work.e.57) bijk = = expected number of transitions from i to j with k observed expected number of transitions from i to j t:ot =k. both theoretically and with examples. we will replace b ijot with bjot . Then we studied the three basic problems involved with any HMM. j) T t=1 γi (t) (2. the general form of the HMM is introduced.e. HIDDEN MARKOV MODELS 31 The reestimation formulas of the model are: πi = probability of being at state i at time t = 1 = γi (1) expected number of transitions from state i to j expected number of transitions from state i T t=1 pt (i. in the prediction task.8 Chapter Summary In this chapter. This is also called a ﬁrst order HMM.CHAPTER 2. In fact. each observation is generated by the probability density function associated with one state at one time step. j) (2. the Markov properties are also described. j) T t=1 pt (i. i.56) aij = = (2. 2. The conditions that HMMs must observe.58) In the following calculations.
We apply an adddrop approach to the training data. the second expert is best for high volatility regions. We ﬁrst give a brief survey of Weigend and Shi’s previous work on the application of Hidden Markov Experts to ﬁnancial time series prediction [SW97]. and the 32 .1 Previous Work In [SW97]. One of the experts is good at forecasting low volatility data. analyze problems with their approach and then propose a new model. We use an HMM with emission density functions modeled as mixtures of Gaussians in combination with the usual transition probabilities. Weigend and Shi propose a Hidden Markov Expert Model to forecast the change of the S&P 500 Index at halfhour time step. with characteristics and predicting bias distinct from each other. 3.Chapter 3 HMM for Financial Time Series Analysis In this chapter we construct an HMM based on the discussions in chapter 2. The “experts” at each state of the Hidden Markov Model applied for ﬁnancial time series prediction are three neural networks. The reestimation formulas are then derived for the Gaussian Mixtures. We introduce Gaussian mixtures which we have as the continuous observation density function associated with each state.
This is because the Dow is a less volatile index than the S&P 500. are adjustable. In the later sections of this chapter we will introduce the Gaussian mixtures where all parameters of each mixture component. we let the HMM combine them together to make predictions for the S&P 500 Index. we will give less conﬁdence in the HMM. such as similar time series or indicator series derived from the original series. Too much weight is given to the early samples. Then in chapter 4. Their model takes the destination time series as the only input time series.CHAPTER 3. In this approach. By providing more available information. The data in each pool is then trained together to generate the prediction for the next coming day. This can cause big problems during highly volatile period or at the turning point of long term trends. To short means to sell something that you don’t have. When the system prediction of the S&P 500 doesn’t coincide with the Dow. In section 3. data at each time step are of equal importance. This means that the experts are ﬁxed and we must trust them during the whole prediction process because there is no way of improving their accuracy of prediction. To do this. Just like we did in equations (2. But there are no reestimations for each of the three “experts”.59). we can balance between long term trend and short 2 term Outliers are a few observations that are not well ﬁtted by the “best” available model. Without having to worry about the exactly mathematical relationship between these two series. 1 The training of the Hidden Markov Experts mainly involves the reestimation of the transition probability matrix A.58) and (2. Then in the prediction phase. By comparing the movement prediction of the S&P 500 with the Dow. This is problematic because even if we can divide a long time series into several segments with diﬀerent volatilities. the counterpart of the “expert”. We will solve this problem with an adjustable weighted EM algorithm in chapter 4. a training pool is created.3 we introduce the multivariate Gaussian mixtures that take a vector composed of the Dow Jones Composite Index and the S&P 500 Index as input. 2 1 . the Dow Jones Index is used as an error controller. you have to borrow the asset that you want to short. This pool keeps updating by adding the last available data and eliminating the ﬁrst data in the pool. Another problem with Shi and Weigend’s model is the lack of additional information to help prediction decisions. HMM FOR FINANCIAL TIME SERIES ANALYSIS 33 third expert is responsible for the “outliers”. we can make the model more reliable. Refer to appendix I for more explanation. segments with similar volatilities may not necessarily reﬂect similar pattern.
µim . Although as mentioned before.2 BaumWelch Reestimation for Gaussian mixtures The reestimation formulas in the previous chapter have mainly focused on the case where the symbol emission probabilities are discrete. wim is the weight of the mth Gaussian mixture at state i.CHAPTER 3. µim . µim .3) The learning of the parameters of the HMMs thus becomes learning of the means and variances of each Gaussian mixture component. Σim ) = 1 (2π)n Σ 1 exp − (o − µ )Σ −1 (o − µ ) 2  (3. Σim ) is a multivariate Gaussian distribution with mean vector µ and covariance N (O. 3 3.2) wim ≥ 0. 1 ≤ i ≤ N. in other words.4) As previously deﬁned M is the total number of Gaussian mixture components at state i and N is the total number of states. 1 ≤ m ≤ M matrix Σ : N (O. we can use the vector quantization technique to map each continuous value into a discrete value via the codebook. Σim ) (3. Therefore it is advantageous to use HMMs with continuous observation densities.1) ot is the observed vector. there might be serious degradation after the quantization. The mixture weights w must satisfy the following constraints: M m=1 wim = 1. HMM FOR FINANCIAL TIME SERIES ANALYSIS 34 whipsaws. The most generally used form of the continuous probability density function(pdf) is the Gaussian mixture in the form of: M bim (ot ) = m=1 wim N (ot . The diﬀerence with the previous example is that 3 A whipsaw occurs when a buy or sell signal is reversed in a short time. 1 ≤ i ≤ N (3. The intermediate variable δ is redeﬁned as: M δi (t) = m=1 δim (t) = 1 αj (t − 1)aji wim bim (ot )βi (t) P m=1 M (3. Volatile markets and sensitive indicators can cause whipsaws . the emitted symbols are discrete symbols from a ﬁnite alphabet list.
HMM FOR FINANCIAL TIME SERIES ANALYSIS 35 0.2 0.1: Three single Gaussian density functions.35 0. Each of them will contributing a percentage to the combined output in the mixture.3 0. .1 0.05 0 0 5 10 15 20 25 Figure 3.CHAPTER 3.25 0.15 0.4 G1: 20% G2: 30% G3: 50% 0.
CHAPTER 3.06 0.2: A Gaussian mixture derived from the three Gaussian densities above.16 mixture of the three Gaussians 0.1 0. 30% of probability that the distribution of x is governed by the second Gaussian. and 50% of probability by the third Gaussian. there is a probability of 20% that the random variable x is governed by the ﬁrst Gaussian density function.14 0.02 0 0 5 10 15 20 25 Figure 3. For each small area δ close to each point x on the x axis. HMM FOR FINANCIAL TIME SERIES ANALYSIS 36 0.04 0.08 0.12 0. .
5) With the new intermediate variable δ deﬁned. Many time series could be correlated with other time series. In other words the observed data are generated by several distinct underlying causes that have some unknown relationship between them. As such the factors that aﬀect . This phenomenon is of great interest in ﬁnancial time series analysis. HMM FOR FINANCIAL TIME SERIES ANALYSIS 37 rather than associating each state with one probability density function. but it will contain plenty of information about the trend of the market when combined with close price. Canada’s stock market is aﬀected by the US economy circle because Canada is the largest trading partner of US. with each mixture component contributing to the overall result by some weight. Σjk ) (3. each observation is generated by a set of Gaussian mixtures. µjk . Then we also need a variable that denotes the probability of being at state j at time t with the kth mixture component that accounts for the current observation o t : γj.9) 3.6) − µim ) σim = (3.k (t) = αj (t)βj (t) N j=1 αj (t)βj (t) · wjk N (ot . For example in stock market. without knowing how much and in what way each cause contributes to the resulted observation sequence. Σjk ) M k=1 wjk N (ot . Each cause contributes independently to the generation process but we can only see the observation sequence as a ﬁnal mixture. the volume by itself is of little interest. µjk . we are able to derive the reestimation formulas below: µim = T t=1 δim (t)ot T t=1 δim (t) T t=1 δim (t)(ot − µim )(ot T t=1 δim (t) T t=1 δim (t) T t=1 δi (t) T t=1 (3.3 Multivariate Gaussian Mixture as Observation Density Function Sometimes we have observation vectors that have more than one element.8) aij = αi (t)aij bj (ot+1 )βj (t + 1) T t=1 αi (t)βi (t) (3.CHAPTER 3.7) wim = (3.
Σj . then the parameters of the model are deﬁned as Θ = (θ1 .3. Also deﬁne a hidden variable E[zij ] as the probability of generating the ith observation with the jth mixture. .3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 38 US stock market also have an impact on the Canadian market. 3.1 Multivariate Gaussian distributions The multivariate mdimensional Gaussian has the same parameters as the single normal distribution. θk )T . The EM algorithm that ﬁnds the parameters that maximize the probability of the observation sequence O consists of the following two steps (we use a mixture of two θ 1 . . . they will short the stocks of Kmart. If they buy the stocks of WalMart. θ2 to explain the algorithm): .2 EM for multivariate Gaussian mixtures We now develop the EM algorithm for reestimating the parameters of the multivariate Gaussian mixture.11) The initial value of the weights wj for each mixture component should sum up to 1. Let θj = (µj . The training process then becomes a process of ﬁnd the parameters that maximizes the probability of this output observation. Some day traders have used a technique called “pairedtrading”. µj .CHAPTER 3. . we assume the observation is generated by k Gaussians. 3. so the overall output is: k j=1 wj N (x. These examples show that there is a need for including relevant information beyond the time series that we are studying. They trade two stocks with similar characteristics at the same time. The probability density function for a Gaussian is nj (x.10) At each state. mj . πj ) be the jth mixture component. Researchers have found that the best Canadian year is the third year of the US president election cycle. except that the mean is in the form of a vector µ and the covariance is in the given by: form of a m × m covariance matrix Σj . aiming at proﬁting from arbitrage opportunities or averse possible risks. Σj ) (3. Σj ) = 1 1 exp − (x − µj )T Σ−1 (x − µj ) j 2 (2π)m Σj  (3.
given that we know the current values of Θ = (θ1 .15) 3.CHAPTER 3. Modeling them as one single observation sequence might give low recognition rate when testing data from some speciﬁc period of time.4 Multiple Observation Sequences In the previous discussions.2.12) − µj )T Σj = (3. This very following would bring its usefulness to an end. Maximization step: Calculate a new maximum likelihood value of Θ = (θ1 . However. As stated by Graham [Gra86]: “There is no generally known method of chart reading that has been continuously successful for a long period of time.14) aij = αi (t)aij bj (ot+1 )βj (t + 1) T t=1 αi (t)βi (t) (3.” Rabiner oﬀers a way to solve the similar kind of problem in speech recognition by splitting . we develop the following formulas for each of the parameters of the HMM with multivariate Gaussian mixtures as observation density function: µim = T t=1 δim(t)ot T t=1 δim (t) T t=1 δim (t)(oi − µj )(oi T t=1 δim (t) T t=1 δim (t) T t=1 δi (t) T t=1 (3. assuming the value taken on by each hidden variable zij is the expected value in the ﬁrst step. the reallife ﬁnancial time series data can have a very long range. Combining the EM algorithm for multivariate Gaussian mixtures and the EM algorithm for HMMs.13) wim = (3. θ2 ). the EM algorithm is applied to a single observation sequence O. the probability of the hidden variable z ij is the intermediate variable γij (t) we talked about in 3. it would be speedily adopted by numberless traders. In the HMM context. This is because the pattern or trend at some point of time may be quite diﬀerent to that of the overall pattern. If it were known. HMM FOR FINANCIAL TIME SERIES ANALYSIS 39 Estimation step: Calculate the expected value E[zij ] of each hidden variable zij . θ2 ).
The reestimation formulas for multiple observation sequences are: (k) (k) Tr K 1 t=1 δim (t)ot k=1 Pk (k) K Tr 1 k=1 Pk t=1 δim (t) K 1 k=1 Pk (k) (k) (k) Tr t=1 δim (t)(ot − µim )(ot (k) K Tr 1 k=1 Pk t=1 δim (t) µim = (3. Assume that each observation sequence is independent of each other. j) and δi (t) are calculated for each training sequence O (k) . .20) + 1) aij = (3. O (2) . HMM FOR FINANCIAL TIME SERIES ANALYSIS 40 the original observation sequence O into a set of k short sequences [Rab89]: O = [O (1) .CHAPTER 3. the probability function P (Oµ) to maximize now becomes: K P (Oµ) = k=1 P (O (k) µ) (3. Each time the training starts from the parameters of last day’s trading yet with today’s data.21) Rabiner’s algorithm is used to process a batch of speech sequences that are of equal importance. In order to capture the most recent trend. pt (i. . . βi (t). . In our ﬁnancial time series. O (k) ] (k) (k) (k) (3. O2 . . so he includes all the training sequences in the training process.17) The variables αi (t).19) wim = (k) K Tr 1 k=1 Pk t=1 δi (t) (k) K Tr M 1 k=1 Pk t=1 m=1 δim (t) K 1 k=1 Pk (k) (k) Tr −1 (k) t=1 αi (t)aij bj (ot+1 )βj (t (k) Tr −1 (k) K 1 t=1 αi (t)βi (t) k=1 Pk (3. . In this way the new parameters can be vowed as averages of the optima given by the individual training sequences. we create a dynamic training pool. . often cases are that patterns change from time to time. .16) where O (k) = [O1 . Then the reestimation formulas are modiﬁed by adding together all the probabilities for each sequence.18) − µim ) σim = (3. The ﬁrst data in the pool are popped out and new data are pushed into it when a trading day ends. OTk ] is the kth observation sequence and T k is the kth observation sequence length. .
.3 for a graphical illustration of the dynamic training pool. In the meantime. At the end of a trading day. HMM FOR FINANCIAL TIME SERIES ANALYSIS 41 Figure 3. when the market closes. This is a segment of data that is updated dynamically throughout the training.3: The dynamic training pool 3. drop the oldest data in the pool. This way the system always trains on fresh data. In the training pool each data sample serves as both prediction target and training sample. we can calculate the return for that day and add it to the pool.5 Dynamic Training Pool We introduce a new concept in the training process called the training pool.CHAPTER 3. See ﬁgure 3. The training is then performed again based on the parameters from last training and the data from the new pool.
2. Based on our experience. On the other hand if the number of states is too small.3. We have launched experiments with diﬀerent state and mixture numbers. the less investment risk and the less return. state probability distribution probabilities Π = {π i } are assigned a set of random values 3. the number of states is the number of the strategies we have to make predictions. The optimal number of states can be decided through experiments.6. HMM FOR FINANCIAL TIME SERIES ANALYSIS 42 3. This setting is similar to Shi and Weigend’s experimental environment where they have one neural net expert associated with each of the three states. Each state is associated with a Gaussian mixture that has 4 Gaussian probability distributions. The input into the HMGM is the real number vector introduced in section 1. The initial between 0 and 0. And there are a set of random values ranging between 0 and 0.1 Making Predictions with HMM Setup of HMM Parameters We construct a HMM with Gaussian Mixtures as observation density function(call it HMGM ). risk of misclassiﬁcation will increase. The architecture of the HMGM is shown in Fig 3. Actually the prediction process is carried out together with the training at the same . each of the transition probability is assigned a real value at random between 0 and 0. Too many strategies will make little distinction between each other and will increase the cost of computing. The initial values of the covariance for each of the Gaussian probability distribution function is assigned 1. The HMM is a fully connected trellis. The number of the mixture components follow the same discussion. This is the normal range of possible returns for each day. This number should be neither too large nor too small.25 for the initial mixture weights w. This can also be explained intuitively using ﬁnancial theory: the more diversiﬁed the strategies. See [WMS95] and [SW97]. The HMM has 5 states.6.3. And the mean of each Gaussian mixture component is assigned a real value between 0.3 and 0.4. 5 states and 4 mixture components are reasonable numbers for the model. Intuitively. In the following section we will describe how the real number is transformed into long or short signals.2.2 Making predictions We now come to the core part of the problem: making predictions before each trading day starts.6 3.CHAPTER 3.
CHAPTER 3. HMM FOR FINANCIAL TIME SERIES ANALYSIS 43 Mixtures G1 s1 G2 G3 G4 s2 State s3 s4 s5 Time Figure 3.4: Hidden Markov Gaussian Mixtures .
we performed the Viterbi algorithm to ﬁnd the most “reliable” Gaussian mixtures associated with the most probable state.1 is the revised multiobservation EM algorithm for HM GM written in the pseudo code. µjm .23) Next we transform the real values generated by the Gaussian mixtures into binary symbols–positive or negative signals of the market. Σjm ) for each state j and M δj (t) = max δi (t − 1)aij 1≤i≤N m=1 wjm N (ot . that will further result in a long strategy. In other words. If the weighted mean of the Gaussian mixtures at the selected state is positive or zero. There is a balance between the size of the training set and the sensitivity of the model. When the training is done and parameters of the HMGM are updated. Σjm ) (3. The weighted average of the mean of the Gaussian mixture is taken as the prediction for tomorrow.2 is revised by replacing mixture component m: the function bijot with the Gaussian density function N (o t . µjm .CHAPTER 3. we include it in the training system to absorb the most updated information. Algorithm 3. when the market return of that day is known to the public. however as most ﬁnancial analysts believe. µjm . The training set can be of any size. we have a threshold at zero. a weighted mean of zero will generate a positive signal. HMM FOR FINANCIAL TIME SERIES ANALYSIS 44 time. In our experiment. This is exactly what human analysts and professional traders do. The bigger the training set. In other words. The HMGM is then trained again with the newest time series and the parameters are updated. the more accurate the model in terms of ﬁnding long term trend. otherwise we output a negative signal.22) M ψjk (t) = argmax δi (t − 1)aij 1≤i≤N m=1 wjm N (ot . Σjm ) (3. The induction step in section 2. a ﬁve day or twenty day set would be desirable because ﬁve days are the trading days in a week and one month consists of roughly twenty trading days. Many of them work hours after the market closes 4pm Atlantic time. However the trained model may not be sensitive to whipsaws. at the end of each trading day.7. we output a positive signal. .
more speciﬁcally the S&P 500 Index.CHAPTER 3.State add current day (the day that has just ended) to the training set 3. These issues are addressed throughout the explanations in this chapter. . we construct an HMM to predict the ﬁnancial time series. HMM FOR FINANCIAL TIME SERIES ANALYSIS 45 Algorithm 3. The revised weighted EM update algorithm puts more emphasis on the recent data.1: Combined trainingprediction algorithm for the HMGM while not End of Series do initialization if tradingDay = ﬁrst day of series then assign random values to the HMM parameters else initialize parameters with the last updated values end if Training data set update 1. which give each state diﬀerent prediction preference.State drop day 1 in the previous training set 2. Each state of the HMM is associated with a mixture of Gaussian functions with diﬀerent parameters.Update all parameters using the EM algorithm Prediction 1. We have achieved impressive results that are shown in chapter 5.7 Chapter Summary In this chapter.Output the predicted signal end while 3. the EM update rule and characteristics of the selected ﬁnancial time series.Run the Viterbi algorithm to ﬁnd the Gaussian mixture with the most probable state 2. Issues considered in the modeling process include the selection of the emission density function. The Dow Jones Index is included in the experiments as a supporting indicator to decide the volatility of the S&P 500 series. Some states are more sensitive to low volatile data and others are more accurate with high volatile segments.Compare the weighted mean of the Gaussian mixture against the threshold zero 3.
The EM algorithm for reestimation is intensively studied under all sorts of problem settings. They have some unique characteristics. including Hidden Markov Model [BP66] [RJ86] [Rab89]. coupled HMM [ZG02]. As we have discussed in chapter 1. but at a reduced conﬁdence.Chapter 4 Exponentially Weighted EM Algorithm for HMM In this chapter we ﬁrst review the traditional EM algorithm in more depth by taking a close look at the composition of the likelihood function and the Q function. conditional/joint probability estimation [BF96] [JP98]. However ﬁnancial time series diﬀer from all the above signals. hierarchical HMM [FST98]. However. A most persuasive argument is the change of the Dow Jones Index. constrained topology HMM [Row00]. This approach works well when the training set is small and the data series studied are not timesensitive or the time dimension is not a major concern. We need to focus more on recent data while keep in mind of the past patterns. Originally formalized in [DLR77]. all these variances of the EM algorithm do not take the importance attenuation in the time dimension into consideration. The traditional EM based HMM models have found great success in language and video processing where the order of the observed data is not signiﬁcantly important. and Gaussian mixtures [RW84] [XJ96] [RY98]. It took the Dow Jones Index 57 years to 46 . They take it for granted that each data sample in the training data set is of equal importance and the ﬁnal estimation is based on this equal importance assumption. no matter how big the training set is. many patterns disappeared or diminished after they had been discovered for a while.
2 The Traditional EM Algorithm The EM algorithm takes the target data as composed of two parts. .i. Realizing the change of importance over the time dimension and motivated by the concept of Exponential Moving Average (EMA) of ﬁnancial time series.1) The function L is called the likelihood function of the parameters given the data. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 47 rise from 200 to 2000.1 Maximumlikelihood We have a density function p(XΘ) that represents the probability of the observation the set of means and covariances of the Gaussian mixtures associated with each state as well as the transition probability matrix and the probability of each mixture component. the probability of observing the whole data set is: N sequence X = {x1 . we further develop a double weighted EM algorithm that is sensitivityadjustable by including the volatility information. In the HMGM model Θ includes p(XΘ) = i=1 p(xi θ) = L(ΘX) (4.CHAPTER 4. yet it only took 12 years to rise from 2000 to 10. We show that the ﬁnal reestimation of the parameters of the HMM and the Gaussian densities can be expressed in a form of a combination of the exponential moving averages (EMAs) of the intermediate state variables.2) Since the log function is monotonous. 4.) with distribution p. . the observed part and the underlying part. Our goal of the EM algorithm is to ﬁnd the Θ∗ that maximizes L: Θ∗ = argmax L(ΘX) Θ (4. We show the convergence of the exponentially weighted EM algorithm is guaranteed by a weighted version of the EM Theorem. we presents the derivation of the exponentially weighted EM algorithm. Assuming the data are independent and identically distributed (i. . 4. xN } given the parameter Θ.d. To ﬁnd better balance between shortterm and longterm trends. 000. we often maximize the logarithm form of the likelihood function because it reduces multiplications to additions thus makes the derivation analytically easier. . The observed data are governed by some distribution with given form .
4) When estimating the hidden data Y . Θ(i−1) ) = E log p(X.Θ (Y ) where Y is the only random variable. Y ) = p(X. Y ) is a function of two variables. Y Θ)X. Θ)p(xΘ) (4. We now deﬁne the complete data likelihood function for the overall data set based on the above joint density function of individual data: L(ΘZ) = L(ΘX. Θ(i−1) = y∈Υ log p(X. yΘ)f (yX. Then the expectation can be rewritten in . Based on Θ (i−1) we try to ﬁnd the current estimates Θ that optimize the Q function. X and Θ can be taken as constants and the complete data likelihood becomes a function of Y : L(ΘX. By the Bayes rule. The task of the EM algorithm is to ﬁnd out the parameters for the distribution governing the observed part and the distribution generating the underlying part. Y Θ) with respect to the unknown data Y . X is called the incomplete data. Then Assume the complete data Z = (X. yΘ) = p(yx. Note the a prior probability p(xΘ) and the distribution of the hidden data are initially generated by assigning some random values. The underlying data can be viewed as a random variable itself. Y Θ) (4. Next the EM algorithm ﬁnds the expected value of the complete data loglikelihood log p(X. Now assume X as the observed part governed by some distribution with known form but unknown parameters. As mentioned above.3) x and y are individual data in the data sample X and Y .CHAPTER 4. given the observed part X and the parameters from last estimation. Y ). we can think of Y as a random variable governed by some distribution fY (y) and h(Θ. Y ) = h X. Θ(i−1) ) dy (4. the joint density function of one data in the complete data sample is: p(zΘ) = p(x. Υ is the space of possible values of y.5) Θ(i−1) are the estimates from last run of the EM algorithm. We deﬁne a Q function to denote this expectation: Q(Θ. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 48 but unknown parameters.
As time goes on. Θ(i−1) = y∈Υ log ηp(X. the new emerged patterns or trends will be inevitably diluted. However as stated in the beginning of this chapter. Y )] = y h(Θ. To achieve this goal.7) The two steps are repeated until the probability of the overall observation does not improve signiﬁcantly. Θ(i−1) ) dy (4. Y Θ)X. ˆ The revised expectation of the likelihood function Q: ˆ Q(Θ. If the local data set that we use as training set is very large.3 Rewriting the Q Function An assumption that is not explicitly stated in the derivation of the EM algorithm in all works cited above is that: in calculating the Q function. Θ(i−1) ) Θ (4. old patterns attenuate and new patterns emerge. Therefore we need a new EM algorithm that gives more importance to more recent data. we ﬁnd the parameter Θ that maximizes the Q function: Θ(i) = argmax Q(Θ. Now we include a timedependent weight η in the expectation to reﬂect information on the time dimension. rather the general picture consisting of all data points is of more interest. 4. given the observed data X and the last parameter estimates Θ. ﬁnancial time series have characteristics which make the time dimension very important. The EM algorithm is guaranteed to increase the loglikelihood at each iteration and will converge to a local maximum of the likelihood. yΘ)f (yX. This is reasonable in problems like speech and visual signal recognition because the order of the data in the observed data sample is not that important.6) In the MStep of the EM algorithm. we must ﬁrst rewrite the Q function. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 49 the form: Q(Θ) = EY [h(Θ. Recall that the Q function is the expectation of the complete data likelihood function L with respect to the underlying data Y . Y )fY (y) dy (4. we take it for granted that each data is of equal importance in the estimation of the new parameters. Θ(i−1) ) = E log ηp(X.8) .CHAPTER 4. The convergence is guaranteed by the EM Theorem [Jel97].
. .CHAPTER 4. . . Θ(i−1) = y∈Υ η log p(X. we can take η out of the log [log η · P (X. yΘ) ≤ η · log P (X. . . αi indicates which mixture . Under this problem setting. θ1 . α1 . . if the expected probability of observing one data is 40%. ηN } discounted by a conﬁdence density. . . θM ). yΘg )] (4.4 Weighted EM Algorithm for Gaussian Mixtures We now show how the parameters of the HMGM introduced in chapter 3 can be reestimated with the weighted EM algorithm. Since the logarithm function is monotonous and 0 < η ≤ 1.9) ˆ This move can bring great convenience to the E step. The weight η is actually a vector of real values between 0 and 1: η = {η 1 . To begin with. η2 .4 × 0. . yΘ)f (yX. given the parameter Θ of the Gaussian mixture: M p(xj Θ) = ηj αi pi (xθi ) i=1 (4. . 4. we might have only 60% conﬁdence in the accuracy of this result–meaning the ﬁnal expected likelihood is only 0. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 50 function. The whole point is that the likelihood of each data is intuitively: Given the last parameter estimates and the observed data. The Q function then becomes: ˆ Q(Θ. yΘ) ≤ log η · P (X. yΘ) ≤ η · P (X.10) where N is the number of data. ﬁrst assume the following density for one single data x j . Y Θ)X.11) The parameters now consist of three types of variables Θ = (η j . Θ(i−1) ) dy (4. the HMGM is only a special case of Gaussian mixtures in nature. The weighted expectation function can be understood ˆ In a later section we will show that the Q function converges to a local maximum using techniques from the EM Theorem. yΘg )] ⇐⇒ [η · P (X. Θ(i−1) ) = E η log p(X. we will use a mixture of Gaussians as an example to illustrate the derivation process. . αM .6 = 24%. ηj is the timedependent conﬁdence weight for the j th observed data. yΘg )] ⇐⇒ [η · log P (X. given its position in the whole data sequence.
. θ1 . And p yi (yi xi .13) p(yX. Θg ) = i=1 p(yi xi . Note the subscripts of η are in the time dimension. . The subscripts for α and θ are in the spatial dimension. θM 1 M ˆ are the best possible parameters for the likelihood L(Θg X. pi component generates the observed data and it must observe the constraint is a density function for the ith mixture component with parameter θ i . . N g αgi pyi (xi θyi ) y p(xi Θg ) g αgi pyi (xi θyi ) y g g M k=1 αk pk (xi θk ) (4. for example y i = k indicates the ith data is generated by the k th mixture component. Θg ) (4. . We can obtain the complete data log likelihood function based on these hypothesized parameters: ˆ log L(ΘX. Θg ) is actually a conditional probability conditional on the data position i. . indicating the parameter of a certain mixture component. Θg ) = = and same as the basic EM algorithm. αg . We assume a set of initial parameter values. As mentioned above the conﬁdence weight η can be taken as a prior probability. . ηN .CHAPTER 4. Θg ) in the likelihood function can be computed using the Bayes’s rule: p(yi xi . Also assume θi as the parameters for the ith Gaussian function of the mixture. .12) = i=1 So we start from a set of guessed parameters and from the assumed parameters we can go g g on to compute the likelihood function. . . . Y ). there are N data in the training set. Y ) = η log P (X. (θ i includes the mean and covariance for the i th Gaussian). However this conditional probability is often ignored in traditional EM algorithm because it was taken for granted that the prior probability is always equal to 1. yN ) retain the independent property. .14) where y = (y1 . Last assume the conﬁdence weight ηn for the nth data. η i and αyi can be thought of as prior probabilities. . . The pyi (yi xi . EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM M i=1 αi 51 = 1. Y Θ) N = i=1 N ηi log P (xi yi )P (y) ηi log αyi pyi (xi θyi ) (4. . αg . We guess that Θ g = η1 . . . .
y))p(yX.j=i = j=1. the constraint on mixture ˆ components still holds at each data point over the time axis. Θg ) y∈Υ N N = = y∈Υ i=1 M M ηi log(αyi pyi (xi θyi )) M N j=1 ˆ p(yj xj . yΘ)p(yX. Θg ) (4. Θg ) p(lxi .yi yN =1 j=1 The latter part of the above formula can be simpliﬁed as below: M M y1 =1 y2 =1 M N ··· δl. Θg ) ˆ p(yj xj . Θg ) N = y1 =1 y2 =1 M N yN =1 i=1 l=1 δl. Θg ) M N = l=1 i=1 ηi log(αl )p(lxi . Θg ) (4. Θg ) η log P (X. Θg ) + l=1 i=1 ηi log(pl (xi Θl ))p(lxi .16) p(ixj . θ g ) = 1.yi log(αl pl (xi θl )) M M M j=1 N ˆ p(yj xj . Θg ) (4. Then the Q function can be written as: M N ˆ Q(Θ. Θ ) yj =1 g p(yj xj . Θg ) N = y1 =1 y2 =1 M M ··· ··· yN =1 i=1 M N log(αyi pyi (xi θyi )) M j=1 ˆ p(yj xj . Θg ) p(lxi . Θg ) = l=1 i=1 M N ηi log(αl pl (xi θl ))p(lxi .j=i = p(lxi . Θg ) M N = M y1 =1 N ··· yi−1 =1 yi+1 =1 M ··· yN =1 j=1.yi yN =1 M j=1 M ˆ p(yj xj .17) . EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 52 With these initial parameters Θg made by guessing. T hetag ) = y∈Υ η log(L(ΘX. Θg ) M i=1 p(yj xj .15) = l=1 i=1 log(αl pl (xi θl )) y1 =1 y2 =1 ··· δl. This simpliﬁcation can be carried out because of the constraint Although the overall expectation is discounted by the time factor. we can try to ﬁnd out the current ˆ parameters Θ that maximizes the Q function: ˆ Q(Θ.CHAPTER 4.
Θ ) N i=1 ηi Substituting λ in the (4.CHAPTER 4. Θg ) (4. Σl ) = 1 1 exp − (x − µl )T Σ−1 (x − µl ) l 2 (2π)d /2Σl 1/2 (4. we have: N i=1 1 ηi p(lxi . ˆ This is done by maximizing the second term in the Q function. Let the above expression equal 0. Θ ) M l=1 αl +λ = N N i=1 M g l=1 ηi p(lxi .21) = i=1 ηi + λ N i=1 ηi .23) Substituting this distribution form into the second term of 4. we can easily get λ = − above equation we ﬁnally get the estimation of α l : αl = N g i=1 ηi p(lxi .18) From above.17 and taking away constant terms that will be eliminated after taking derivatives. Θ ) M l=1 αl +λ (4.24) l 2 2 . we know pl (xµl . EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 53 ˆ It becomes clear that to maximize the Q function we must maximize both terms in the above formula separately. Given the form of a Ddimensional Gaussian distribution. the left side of the equation becomes: M l=1 N g i=1 ηi p(lxi.20) (4. Θg ) M N l=1 i=1 = 1 1 ηi − log(Σl ) − (xi − µl )T Σ−1 (xi − µl ) p(lxi . Θg ) + λ l=1 i=1 l αl − 1 =0 (4.19) Summing both the numerator and denominator of the ﬁrst term over l. The ﬁrst term has α l which must observe the constraint be applied. we get: M l=1 N i=1 ηi log(pl (xi Θl ))p(lxi .22) Next we derive the update rule for the mean and covariance of the Gaussian components. So it is a constrained function optimization problem where the Lagrange multiplier λ can ηi log(αl )p(lxi . Θg ) + λ = 0 αl (4. And the maximum is achieved when the following equation is satisﬁed: ∂ ∂αl M N l αl = 1.
Θg ) Σl − (xi − µl )(xi − µl )T = 0 (4. Θg ) in previous discussions. Θ ) (4. Θg ) = 0 (4. Θ ) N g i=1 ηi p(lxi . we have two intermediate variables describing the whole process of generating the observed data: γi (t) represents the probability of being in state i at time t and w im is the probability of picking the mt h mixture component at state i. These intermediate variables combine together to generate the probability of the observed data.CHAPTER 4. The calculation of the transition probability a ij also involves a time variable .5 HMGM Parameter Reestimation with the Weighted EM Algorithm After showing the derivation of the exponentially weighted EM algorithm for the Gaussian mixture case. We can deﬁne the probability δim (t) that denotes the mth mixture density generating observation o t as: δim (t) = γi (t) wim bim (ot ) bi (ot ) (4.27) So the estimation for Σ is: Σl = N g i=1 ηi p(lxi .25) Then we can work out the current estimation for µ l : µl = N g i=1 xi p(lxi .29) We ﬁnd that the δim (t) enjoys the same status as the prior probability p(lx i . take the derivative of (4. Θ )(xi − µl )(xi N g i=1 ηi p(lxi . As discussed in chapter 2 and 3. we now return to our HMGM model.26) Similarly.28) 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 54 The maximization of the above expression is achieved by taking the derivative with respect to µl and letting it equal to 0: N i=1 ηi Σ−1 (xi − µl )p(lxi . we get: N i=1 ηi p(lxi .45) with respect to the covariance Σ l and setting it equal to 0. Θ ) − µ l )T (4.
30) − µim ) σim = (4. The calculation of a ﬁve day exponentially moving average for this new intermediate variable consists of two steps: First. let’s take a closer look at the calculation and properties of the EMA. Second. such as the Moving Average Convergence Divergence (MACD). Shortterm EMAs response faster than longterm EMAs to changes in the trend. We denote pt (ij) as the exponential moving average of the original variable pt (ij) at time t.31) wim = (4.CHAPTER 4.32) + 1) aij = (4. When the curves of two EMA with diﬀerent length cross each other.6 Calculation and Properties of the Exponential Moving Average Before we go on to show how we can rewrite the reestimation formulas with EMAs of the intermediate variables. the EMA of each day is: pt (ij) = ρ × pt (ij) + (1 − ρ) × pt−1 (ij) (4. calculate the arithmetic average of the ﬁrst ﬁve days. taking the timeintensifying weight into consideration: µim = T t=1 ηt δim (t)ot T t=1 ηt δim (t) T t=1 ηt δim (t)(ot − µim )(ot T t=1 ηt δim (t) T t=1 ηt δim (t) T t=1 ηt δi (t) T t=1 ηt αi (t)aij bj (ot+1 )βj (t T t=1 ηt αi (t)βi (t) (4. Technical analysis people developed other indicators based on the EMA. the EMA changes as time moves on. The EMA is invented originally as a technical analysis indicator that keeps track of the trend changes in ﬁnancial time series. starting from the sixth day.1 shows the signals indicated by comparing the 30day EMA and the 100day EMA for an individual stock Intel : Take the variable pt (ij) for illustration.33) 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 55 thus can also be weighted.34) . Denote this average as p5 (ij). As by its name implies. So the exponential weighted update rules for the HMGM are derived in a similar form. this usually means a trend change. Figure 4.
1: When the 30day moving average moves above the 100day moving average. (picture from stockcharts. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 56 Figure 4. the trend is considered bullish.com) .CHAPTER 4. the trend is considered bearish. When the 30day moving average declines below the 100day moving average.
2).1) and (4. the importance of the early data drops exponentially. pt (ij) is the actual intermediate variable value at time t.. then the value of ρ would be: 2 = 0..3.. as shown in Fig 4. 1+5 ρ= (4. the last day in the training window takes up a quite large portion and should notably aﬀect the training procedure.1): pt (ij) = ρ × pt (ij) + (1 − ρ) × pt−1 (ij) = ρ × pt (ij) + (1 − ρ) × [ρ × pt−2 (ij) + (1 − ρ) × pt−2 (ij)] = ρ × pt (ij) + (1 − ρ) × [ρ × pt−2 (ij) + (1 − ρ)2 × pt−2 (ij)] (4. We keep running the second step until the last day of the training pool. (4.2).. This is easily shown when we go into the details of the calculation of equation (4..CHAPTER 4. As time goes on. which means we are calculating a ﬁve day EMA. while pt (ij) is the exponential moving average of that variable at time t.37) ... p20 (ij) = p20 (ij) × ρ + p19 (ij) × (1 − ρ) As we can see from equation (4.1) until the 20th day: p6 (ij) = p6 (ij) × ρ + p5 (ij) × (1 − ρ) p7 (ij) = p7 (ij) × ρ + p6 (ij) × (1 − ρ) .35) In equations (4. can be calculated in a similar way: δi (t) = ρ × δi (t) + (1 − ρ) × δi (t − 1) time. The value of ρ is given by the formula: ρ = moving window 2 1+k . The EMA of the second intermediate variable δ i (t) that stores the probability of being at state i at time t. ρ is the weight that we have been talking about all the k = the number of days in the For example. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 57 Note that in the above equation. The 5day EMA at the 20th day of the training pool is calculated by running equation (4.36) So the last day accounts for one third of the ﬁnal value.3333. if the size the moving window is 5.
Our task now is to ﬁnd an expression of time for η. in calculating the EMA at day t.2 0. the weight of the EMA at day We can also know by inference that day t − 3 will have a portion of (1 − ρ) 3 in EMA at day t − 1 takes up a portion of (1 − ρ) while the weight of the EMA at day t − 2 is only (1 − ρ) 2 . We show the formulas with the exponential discount weight η. t. while the large part of weight is given to the recent data. As is shown in the ﬁgure.7 EMA 0. Fig 4. The old data are also included in the ﬁnal .5 EMA percentage 0. the information from the early data are not lost. A nice property about the exponentially weighted intermediate parameters is that.2: The portion of each of the ﬁrst 19 days’ EMA in calculating the EMA for the 20th day.4 0.5).1 0 0 2 4 6 8 10 trading day 12 14 16 18 20 Figure 4.3 0. there is still one problem regarding η: how is the timedependent variable η deﬁned? In fact η can be taken as a density function whose distribution is prior knowledge. However.6 0. 4.4 gives a graphical view of the portion each day takes in calculating of the EMA.7 HMGM Parameters Reestimation in a Form of EMA In previous sections we derived the exponentially weighted estimation rules for the parameters of the HMGM. when calculating this weighted average. Now we see from (4.CHAPTER 4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 58 0. the EMA at the 19th day accounts for 2/3 of the EMA at day 20.
15 0.25 daily data percentage 0.CHAPTER 4.3: The portion of each of the 20 days’ actual data in calculating the EMA for the 20th day. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 59 0.3 0.2 0.35 daily data 0.1 0. The last day account for 1/3 of the EMA at day 20.05 0 0 2 4 6 8 10 trading day 12 14 16 18 20 Figure 4. The data earlier than day 10 take up a very small portion but they are still included in the overall calculation .
The HMM will ride on the new trend quickly. If k < t < T.38) is turned into a form of the EMA of the intermediate variable δ im (t): η1 δim (1) + η2 δim (2) + · · · + ηT −1 δim (T − 1) + ηT δim (T ) . the EMA was invented as a technical analysis tool that deals with ﬁnancial time series data. there are a set of middle variables: α i (t). Let δim (t) represent the kEMA of δim at time t. So these model variables are also related to the time series data. the less important they become. Giving more recent data more weight makes the HMM getting sensitive to trend changes in time series. These middle variables all have their actual meanings: they indicate the probabilities of some events at a certain point of time. and thus take up less portion in the average. The sum can be extended as: T t=1 ηt δim (t) η1 δim (1) + η2 δim (2) + · · · + ηT −1 δim (T − 1) + ηT δim (T ) (4. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 60 average. etc directly. by a discounted weight. Based on these rules. ηt = (1 − ρ)(T −k) · k . If t = T. If 1 ≤ t ≤ k. such as prices. βi (t). volume. But their impact never disappear. 2. δi (t). Then set η t by the following rules: 1 1. That is one important motivation of replacing the arithmetic average with EMA in the EM update rule. We take the denominator of the µ reestimation for example. This is another motivation we rewrite the reestimation formulas in the form of EMA of the model variables. without having to make too many mistakes at the beginning of a new trend before it gets ﬁt for the new trend. The older the data. ηt = ρ · (1 − ρ)(T −t) . the expression (4. δim (t).38) = Let ρ = 2 1+k where k is for the multiplier in a kEMA. Originally. In the EM algorithm for HMM parameters reestimation. 3. From T t=1 ηt δim (t) we can see each δim (t) at a timing point t is associated with one value of η t .CHAPTER 4. ηt = ρ.
µim = T t=1 ηt δim (t)ot δim (T ) T t=1 ηt δim (t)(ot (4.43) + 1) aij = T t=1 ηt αi (t)aij bj (ot+1 )βj (t αi (T )βi (T ) (4. then from k + 1 to T .39). And let P (X) .39) The ﬁrst k terms work together to generate the EMA for the ﬁrst k periods. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 1 1 · δim (1) + · · · + (1 − ρ)(T −k) · · δim (k) k k +ρ · (1 − ρ) · δim (T − 1) + ρ · δim (T ) +ρ · (1 − ρ)(n−(k+2)) · δim (k + 2) + · · · 61 = (1 − ρ)(T −k) · = (1 − ρ)(T −k) · δim (k) + ρ · (1 − ρ)(n−(k+1)) · δim (k + 1) +ρ · (1 − ρ) · δim (T − 1) + ρ · δim (T ) (4.35) to δim and induction on Equation (4.8 The EM Theorem and Convergence Property This section shows that our exponentially weighted EM algorithm is guaranteed to converge to a local maximum by a revised version of the EM Theorem. we can get the following result: (1 − ρ) · ρ · δim (T − 1) + (1 − ρ) · δim (T − 2) = (1 − ρ) · δim (T − 1) + ρ · δim (T ) = δim (T ) (4.44) 4.CHAPTER 4.42) wim = (4. the reestimation formulas ﬁnally came into a form of EMA of some intermediate variables. The EM Theorem for exponentially weighted EM algorithm: Let X and Y be ˆ ˆ ˆ ˆ random variables with distributions P (X) and P (Y ) under some model Θ.41) − µim )(ot − µim ) σim = δim (T ) δim (T ) δi (T ) (4. For review of the EM Theorem please refer to [Jel97]. apply Formula (4.40) Therefore.
η) log ˆ P (Y X. E.86): Y ˆ ˆ P (Y X. Xη) ˆ P (Y X. For a thorough review of the Jensen’s inequality please refer to [?] [HLP88] and [GI00]. η) = = = ≥ Q.47) Y Y Y ˆ P (Y. η) log P (Xη) − Y ˆ ˆ P (Y X. η) log P (Xη) ˆ P (Y. η) Y Y ˆ P (Y X. η) ˆ P (Y X. η) (4. The Jensen’s inequality basically states that misestimation of the distribution governing a random variable will eventually result in increased uncertainty/information. So. Then. η) ˆ P (Y X. η) log + ˆ P (Y X. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 62 ˆ and P (Y ) be the distribution under a model Θ after running the exponentially weighted EM algorithm. Xη) ˆ P (Y X. Xη) ˆ ˆ ˆ P (Y X. Xη) P (Y X. η) log P (Xη) ≥ 0 (4. η) ≤ 1 Y And the logarithm function is a monotonous one. Xη) (4.45) Y ˆ ˆ This Theorem implies that if for a model Θ and an updated model Θ . η) log ˆ P (Y. D.CHAPTER 4. Xη) ˆ − P (Y X. if the left hand ˆ ˆ inequality holds. ˆ ˆ log P (Xη) ≥ log P (Xη) ⇔ ⇔ Y ˆ ˆ P (Y X. η) log ˆ P (Y X. ˆ P (Y. η) ˆ P (Y. Proof: From the derivation of the exponentially weighted update rule we know 0≤ ˆ P (Y X. . Y Y ˆ P (Y. η) log P (Xη) ≥ ˆ ˆ P (Y X. η) log ˆ P (Y X. η) log P (Xη) ˆ ˆ P (Y X. X) ≥0 ˆ P (Y. X) The last step follows from the Jensen’s inequality. η) log ≥ 0 ⇒ P (Xη) ≥ P (Xη) ˆ P (Y. η) log P (Xη) − Y ˆ ˆ P (Y X. η) log ˆ P (Y. then the observed data Y will be more probable under Θ than Θ. η) log ˆ P (Y. Xη) ˆ + P (Y X. Xη) ˆ P (Y X.46) Y Y Then we can perform a series of logarithm transformation operations on expression (4.
i. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 63 Market Prediction Figure 4. 4. It is calculated by comparing the trend signal. the direction of the ﬁnancial time series reverse at a very high frequency in a very short period of time.e.9 The Double Weighted EM Algorithm There is still room for improvement in the previous exponentially weighted EM rule. Multiply the weight for the most recent data η with a second weight ζ.9.9.5. positive and . sometimes. However. he may have a better opportunity of getting a higher gain. there is danger of falling right behind the market movement. 4. as shown in Fig 4. the prediction of the turning points are not the actual market turning points.2 Double Exponentially Weighted Algorithm To solve this problem. we have to consider the cost of the improvement of sensitivity and ﬁnd out a way to overcome any possible problem. ζ is the conﬁdence rate for the weight η.4: The prediction falls behind the market during highly volatile periods of time.5 shows a problem caused by high sensitivity. or if he refers to the long term trend during that period of time. especially during the highly volatile periods. This is because the system is too sensitive to recent data that they fall into the pitfall of high volatility. 4. just as any other improvements. Actually if one keeps making prediction of one same direction.. This is because the system is too sensitive to recent changes. In these cases. we arm the HMM with a selfconﬁdence adjustment capability. Not only the turning points of the market are all missed.CHAPTER 4. However.1 Problem with the Exponentially Weighted Algorithm The exponentially weighted EM algorithm quickly picks the trend at times when there is a turning of the trend. Fig 4.
the conﬁdence weight ζ is calculated as the total number of days that Dow coincides with the S&P divided by 5. If they have the same directions for everyday of the past 5 days. To ﬁnd better balance between longterm and shortterm trends. Then time the two weights together. EXPONENTIALLY WEIGHTED EM ALGORITHM FOR HMM 64 negative symbols of the S&P 500 Index with the trend signal of the Dow Jones Index. The adjustable weighted exponential average for the two intermediate variables are: pt (ij) = η × ζ × pt (ij) + (1 − η × ζ) × pt−1 (ij)δi (t) = η × ζ × δi (t) + (1 − η × ζ) × δi (t − 1) Then apply the formulas (4. We can of course use other indicators that reserves long term trend well. comparing to the S&P 500 and NASDAQ. We show that the reestimation of the parameters come in a form of exponential moving average of the intermediate variables used for calculation of the ﬁnal parameters. but for the time being. we develop a doubleweighted EM algorithm. Comparing the S&P 500 with the Dow for the past ﬁve days. The Dow Jones is a relatively mild index. we will use the Dow for our experiment. Otherwise. The EWEM algorithm takes the time factor into consideration by discounting the likelihood function in traditional EM by an exponential attenuation weight which we call the conﬁdence weight.48) 4.16) to get the updated parameters.134.CHAPTER 4.10 Chapter Summary In this chapter we derived a novel exponentially weighted EM algorithm for training HMGMs. (4. We prove the convergence to a local maximum is guaranteed by an exponential version of the EM Theorem. the HMM gets full conﬁdence 1. This is very important in ﬁnancial time series analysis and prediction because ﬁnancial time series are known to be nonstationary and diﬀerent periods should be of diﬀerent interest in generating a prediction for the future. this is the weight that we will use for further calculation. .
2 ∗ (τ (t)/100 − 0. The last section shows the results from the weighted EM algorithm. This may not exactly be the case in real life ﬁnancial time series.7 ∗ f (t − 1) + 0.005) + 0.1) τ (t) is a random number that takes 20% of the overall output. We test the recurrent neural network and the HMGM on this artiﬁcial data set for 400 days. Results show that over the test period of 395 days (prediction starts on the 6th day). We test a set of data with diﬀerent training windows. And lastly .00001 (5.Chapter 5 Experimental Results This section has all the experimental results inside. The third section includes prediction results of the HMM with Gaussian mixtures as emission density functions and trained with the simple weighted EM algorithm. Both models predict the signal 5 days ahead.1 ∗ 0.1 Experiments with Synthetic Data Set We ﬁrst test diﬀerent models on an artiﬁcially generated data set. 5. 20% of the data are generated randomly. the HMGM performs better than 65 t is from that of step t − 1. 70% of the data at step there is a 10% of upgrowing trend at the growth rate of 0.00001. The ﬁrst section shows the prediction results on a synthetic data set. The second section includes the predictions for the years 1998 to 2000 of the NASDAQ Composite Index with a recurrent neural network. but our model should be able to at least handle it before applied to real data. The data is generated by a function composed of three parts: f (t) = 0. This is to try to hold the impact of the recent past.
95 0.55%. And the EWHMGM has less volatility in prediction accuracy at diﬀerent runs of experiment.9 0 50 100 150 200 trading day 250 300 350 400 Figure 5. but has a better overall performance. As we can see the exponentially weighted HMGM(EWHMGM) outperforms the recurrent network for positive signals and slightly behind on negative ones. the recurrent neural network by 10.1 1. . However. it is clearly shown that the HMGM is a better model of the two statistical learning models and it works better than the buyandhold strategy (which is the index). This is because the data are composed in a way that a large portion of the data is taken directly from the recent past. The following table shows the comparisons of accuracy between the recurrent network and the exponentially weighted HMGM.05 1 0.2 1. The HMGM underperforms a followthetrend strategy which takes yesterday’s signal as today’s prediction.2% above the cumulative return based on the artiﬁcial generated data.25 1. The accuracy rates are based on statistics from ten runs of both models on ten data sets generated by the equation 5.CHAPTER 5. and 22. In the following sections we will test the HMGM on reallife data.15 return 1. the S&P 500 Index.1: Experimental results on computer generated data. EXPERIMENTAL RESULTS 66 1.1.3 follow HMM recurrent network index 1.
3 return 1. EXPERIMENTAL RESULTS 67 Recurrent Net EWHMGM Positive Min/Max/Avg 0.72/0.1: Comparison of the recurrent network and exponentially weighted HMGM on the synthetic data testing.79/0. no matter it is a bull market or bear market.55 0.64/0.61 Table 5.80/0.2 1.5 1. As can be seen from the data. 5.64 Negative Min/Max/Avg 0.53/0. steady trends. This is especially . 1.4 1.58 0. By that we mean the recurrent network performs better during long term bull market (as in late 1998 and early 1999) or long term bear market (as in the year 2000).CHAPTER 5.65/0.52/0. As we can see from the above ﬁgures.1 1 0.57 Overall Min/Max/Avg 0.54/0.54/0.44/0. the recurrent neural network has a positve bias toward steady trends.9 0 50 100 150 trading day 200 250 300 Figure 5.62 0.2: Prediction of the NASDAQ Index in 1998 with a recurrent neural network.6 prediction market 1. and has less volatility.2 Recurrent Neural Network Results The NASDAQ Index data are separated into three years. The EWHMGM outperforms the recurrent network in overall result. recurrent neural networks work better during low volatile.68/0.54/0.
15 return 1.75 0.8 0. . 1.1 1.CHAPTER 5.2 1.3 1. EXPERIMENTAL RESULTS 68 1.95 0.9 0.7 0 50 100 150 trading day 200 250 300 Figure 5.25 prediction market 1.35 prediction market 1.15 1.3: Prediction of the NASDAQ Index in 1999 with a recurrent neural network.85 0.9 0.95 0.4: Prediction of the NASDAQ Index in 2000 with a recurrent neural network.85 0.8 0 50 100 150 trading day 200 250 300 Figure 5.05 return 1 0.1 1.05 1 0.2 1.25 1.
And prediction becomes worse as the training window is changed in either direction.4 1.1 1. The past data has too much impact on the current training and prediction. EXPERIMENTAL RESULTS 69 apparent in the prediction for the year 2000.45 1. 1998.35 1.15 1.5 prediction market 1. 5.95 0 50 100 150 200 trading day 250 300 350 400 Figure 5. The training window introduced in chapter 3 is adjustable from 10day to 100day. the actual market had an apparent reverse in direction. Part of the reason is due to the memory brought by the recurrent topology. from a decline to an . 1.3 HMGM Experimental Results We launch a series of experiments on the Hidden Markov Gaussian Mixtures (HMGM).25 1.05 1 0. The EM algorithm is performed on the training data with diﬀerent training windows.3 return 1.5: Prediction of the 400day S&P 500 Index starting from Jan 2.CHAPTER 5.2 1. window size = 10 Training We test all the above mentioned approaches in chapter 4 with 400 days of the S&P 500 Index starting from Jan 2. The best result is achieved by the 40day training day window. Results show that the prediction performance changes as the training window size is adjusted. 1998. From t213 to t228. First take a look at the HMGM with Rabiner’s arithmetic average update rule.
CHAPTER 5. EXPERIMENTAL RESULTS
70
t213 t228 1.6 prediction market 1.5
1.4
1.3 return 1.2 1.1 1 0.9 0 50 100 150 200 trading day 250 300 350 400
Figure 5.6: Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. window size = 20
Training
increase. The simple HMGM was not able to recognize this in a short time, which costs a loss in proﬁt. But the overall cumulative proﬁt at the end of the testing period is still 20% above the market return. as input. We do not know the exact relationship between those two ﬁnancial time series, but the multivariate HMM uses the EM algorithm to ﬁnd the best parameters that maximizes the probability of seeing both series data. The multivariate HMGM model generates similar cumulative return as the single series HMGM but has less volatility at the later part of the testing data. The exponentially weighted EM algorithm performs signiﬁcantly better than the ﬁrst two models, mainly because it picks up a reverse trend much faster. During the periods of t164 to t178 and t213 to t228, the trend reversals were recognized and we gain points here, so that the later cumulative gains can be based on a higher early gain. However, it also loses points during periods where the volatility is higher, such as the period after t228. At the end of the whole period, a bull market started and the model responded quickly to this The multivariate HMGM takes a vector of {S&P daily return, Dow Jones daily return}
CHAPTER 5. EXPERIMENTAL RESULTS
71
2.2 prediction market 2
1.8
1.6 return 1.4 1.2 1 0.8 0 50 100 150 200 trading day 250 300 350 400
Figure 5.7: Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. window size = 40
Training
signal and made no mistake during the long bull. We made nearly twice as much at the end of the 400 days. Then we tested the double weighted EM algorithm. There isn’t quite much diﬀerence from the single exponentially weighted EM algorithm. This is partly because the S&P 500 and the Dow give identical direction signals most of the time during the testing period. But a small diﬀerence at the end of the 400 days gives the double weight algorithm a 2% better performance. Finally we produce an allinone picture with all the prediction results from diﬀerent models for comparison. As seen from the picture, the two weighted HMGM performs convincingly better than the traditional EM algorithm.
5.4
Test on Data Sets from Diﬀerent Time Periods
To show the results generated by the HMGM are not random and can repeatedly beat the buyandhold strategy, we test the model on data sets from diﬀerent time periods. The data tested are four segments of the S&P 500 Index: November21994 to June31996,
CHAPTER 5. EXPERIMENTAL RESULTS
72
1.5 prediction market 1.4
1.3
1.2 return 1.1 1 0.9 0.8 0 50 100 150 200 trading day 250 300 350
Figure 5.8: Prediction of the 400day S&P 500 Index starting from Jan 2, 1998. window size = 100
Training
June41996 to December311998, August51999 to March72001 and March82001 to October112002 (the period January21998 to August41999 has been tested before), The ﬁrst two segments are the index in the bull market cycle while the latter two segments are the periods of a bear market. Experiments show that HMGM performs equally well under both situations.
5.5
Comparing the Sharpe Ratio with the Top Five S&P 500 Mutual Funds
The Sharpe Ratio, also called the rewardtovariability ratio, is a measure of the riskadjusted return of an investment. It was derived by W. Sharpe [Sha66]. Sharpe argues that the mean and variance by themselves are not suﬃcient to measure the performance of a mutual fund. Therefore he combined the two variable into one and developed the Sharpe Ratio. The deﬁnition of the Sharpe Ratio is quite straightforward: Let R F t be the return on the fund in period t, RBt the return on the benchmark portfolio or security in period t, and
Note from t213 to t228.3 return 1.4 return 1.9: Prediction with HMGM for 400 days of the S&P 500 Index.1 1 0.7 prediction market 1.3 1.6 prediction market 1.1 1 0.4 1.2 1. the HMM was not able to catch up with the change in trend swiftly.9 0 50 100 150 200 trading day 250 300 350 400 Figure 5.CHAPTER 5. EXPERIMENTAL RESULTS 73 t213 t228 1. .6 1.9 0 50 100 150 200 trading day 250 300 350 400 Figure 5.10: Prediction with multivariate HMGM for 400 days of the S&P 500 Index.5 1.5 1.2 1. 1.
8 0 50 100 150 200 trading day 250 300 350 400 Figure 5.CHAPTER 5.11: Prediction with single exponentially weighted HMGM 2 prediction market 1.4 1. EXPERIMENTAL RESULTS 74 t164 t178 t213 t228 2 prediction market 1.8 1. .6 return 1.4 1.2 1 0.6 return 1.12: Prediction with double weighted HMGM that has a conﬁdence adjustment variable.8 0 50 100 150 200 trading day 250 300 350 400 Figure 5.8 1.2 1 0.
1 1 0.13: Results from all models for comparison. 1.2 1.8 0 50 100 150 200 trading day 250 300 350 400 Figure 5.6 1.2 1 0. 1994 to June 3.14: Results on the period from November 2.4 return 1.6 return 1.4 1. EXPERIMENTAL RESULTS 75 2 double weight single weight HMGM multivariate Gaussian market 1.7 prediction market 1.8 1.9 0 50 100 150 200 trading day 250 300 350 400 Figure 5.3 1.CHAPTER 5. 1996 .5 1.
1998 2.16: Results on the period from August 5. EXPERIMENTAL RESULTS 76 2.6 return 1.8 0 50 100 150 200 trading day 250 300 350 400 Figure 5.2 1 0.8 0 50 100 150 200 trading day 250 300 350 400 Figure 5.4 1.15: Results on the period from June 4.8 1.2 prediction market 2 1.8 1. 2001 .CHAPTER 5.4 1.2 1 0.6 return 1. 1996 to December 31. 1999 to March 7.2 prediction market 2 1.
4 return 1.8 1.6 1.3) and σD be the standard deviation over the period: σD = The Sharpe Ratio (Sh ) is: Sh = ¯ D σD (5.5) T t=1 (Dt T ¯ − D)2 (5. 2002 Dt the diﬀerential return in period t: Dt = RF t − RBt ¯ Let D be the average value of Dt over the historic period from t = 1 through T : 1 ¯ D= T T (5.4) As we can see. Using the Sharpe Ratio.8 0.6 0 50 100 150 200 trading day 250 300 350 400 Figure 5.2 1 0.17: Results on the period from March 8. the ratio indicates the average diﬀerential return per unit of historic variability of the diﬀerential return.CHAPTER 5. we can compare the return of . EXPERIMENTAL RESULTS 77 2 prediction market 1.2) Dt t=1 (5. 2001 to October 11.
The HMGM outperforms all of them. This is because the tstatistic equals the Sharpe Ratio times the square root of the total number of testing units.098967 Table 5.000502 0. Four of the ﬁve funds have negative Sharpe Ratios.3: Comparison of the Sharpe Ratios during the period of March 8. 2001 to October 11.011475 0.01726 0. The Sharpe Ratio can be used for signiﬁcance testing for ﬁnancial time series analysis and prediction. diﬀerent strategies at the same risk. Symbol BTIIX FSMKX SPFIX SWPPX VINIX HMGM Explanation 0.05465 0. The absolute value of the positive one is much smaller than the HMGM prediction. 1999 to March 7.01856 0.02948 0. EXPERIMENTAL RESULTS 78 Symbol BTIIX FSMKX SPFIX SWPPX VINIX HMGM Explanation 0.00142 0. We now compare the top 5 S&P 500 mutual funds that performed best over the past three years with the HMGM model we proposed.003319 0.2: Comparison of the Sharpe Ratio between the HMM and the top 5 S&P 500 funds during the period of August 5. 4 of the 5 funds have negative Sharpe Ratios while the positive one is much smaller than the HMGM.084145 Table 5.CHAPTER 5. .001446 0. This is the meaning of beat the market mentioned in chapter 1. 2002.024323 0. The Sharpe Ratio is thus proportional to the t − statistic of the means. The Double weighted HMGM consistently outperforms the top ﬁve S&P 500 mutual funds. 2001.
4). the input sequence is a time dependent value that changes from time to time.CHAPTER 5. audio and video signals processing (see 2. The input sequence is just another sequence data. this dependence is not important. they used three neural networks (called experts) as the emission functions at each state. Financial time series diﬀers from other sequence data in that they are time dependent. In video/audio/language/bio info signals processing. Patterns in ﬁnancial time series tend to change over time. they use a binomial density function at each state to represent the risk of default. These techniques enrich the representation power of the HMM in dealing with ﬁnancial time series data. the second one is good at high volatility regions. For example. in [MCG04]. EXPERIMENTAL RESULTS 79 5. They represent the probabilities of a variety of situations happening in the HMM. Second reason lies in the EM algorithm. Adding the timedependent weights incorporates temporal information into the training process of the HMM. As described before. the double weighted EM algorithm enables selfadjustment of the weights. Their model enables the HMM to generate the output probability based on the current state as well as the current input data sample from the input sequence. In the two weighted EM algorithms we propose in the thesis. One reason is it is usually hard to ﬁnd functions associated to the states that characterizes the behavior of the time series. This is an empirical research. or the data must be treated of equal importance. Further. as we have shown in the super ﬁve day pattern and the January Eﬀect cases. They test on 8 years of daily S&P 500 data from 1987 to 1994. In our experiments. So there are three states in all. But they still don’t realize the importance of time. We now compare our experimental results with Shi and Weigend’s work [SW97]. Bengio [BF96] introduced an InputOutput Hidden Markov Model (IOHMM) and applies it to the Electroencephalographic signal rhythms [CB04]. and the third expert acts as a collector for the outliers. bioinformatics.6 Comparisons with Previous Work Hidden Markov Model. Their return over the 8 year period is 120%. Each expert has its own expertise: the ﬁrst one is responsible for low volatility regions. But there have been few applications on the application of HMMs to ﬁnancial time series analysis and prediction. each of the ﬁve 400day testing periods generates . The state parameters in the reestimation formulas all have actual meanings. as a sequence learning model. has been applied to many ﬁelds including natural language processing.
18: Too many positive signals result in a curve similar to the index. The proﬁt of the prediction thus follows the index behavior to a large extent. EXPERIMENTAL RESULTS 80 a return close to 100%. . many times they are taking a long position. In all our ﬁve 400day periods. This means that there are too many positive signals generated by the prediction system. the proﬁt curve has a similar shape as the index curve.CHAPTER 5. Figure 5. This reﬂects a more sophisticated strategy that generates dynamically changing signals. In other words. As we can see from the proﬁt curve (Figure 5. the proﬁt curves based on the prediction have a diﬀerent shape than the index curve. However the main problem of their prediction is not the rate of return.6).
Results show that there is no incorrect bias as shown in the neural network experiments. It is an empirical study what the size of the training window should be. The HMGM performs equally good in both bull market and bear market. The accuracy of prediction decreases as the size of the training window 81 . 6. the emission functions provide a broad range of predictions. Comparing with neural network experts. In a 20day experiment the cumulative rate of return is 20% above the actual market return over the 400 days of training and testing period. HMGM can eventually catch up with the trend. We show our ﬁndings through the experiments and indicate possible improvements to the tested model.1 Conclusion Our experiments can be sorted into three categories: testing of HMGM with diﬀerent training window sizes. the parameters of the HMGM as emission density functions can be clearly derived from the EM algorithm. Experimental results show that a 40day training window generates the best prediction. the training window is used in both training and testing as these two processes are combined together throughout the experiment. Although at times where the ﬁnancial time series is volatile. In fact. testing of the weighted EM algorithm in comparison with the classic EM algorithm. We introduce an important concept in the model design. By using a mixture of Gaussian density functions at each state.Chapter 6 Conclusion and Future Work This chapter concludes the thesis. the training window. the use of supporting time series to help predict the targeting series. it will make more mistakes.
It will be interesting to apply the HMM technique to a portfolio of ﬁnancial assets and study the time series generated by these groups of assets. Functions and how functions are mixed at each state are not studied in this thesis. The second weight which is calculated by the divergence of the S&P 500 and the Dow Jones Index. i. the revised model generates more stable predictions and a higher rate of return. This means to get the best prediction. In other words. Gaussian distribution is the most popular distribution when modeling random or unknown processes. buy and short signal reverses in very a very short time.2 Future Work Rather than the predeﬁned 20day training window. 6. This algorithm is motivated by the exponential moving average technique used widely in ﬁnancial engineering. Noticing that recent data should be given more emphasis than older data from the past.e. The Viterbi algorithm could play an important . the model might fall in the pitfall of frequently changing whipsaws. Then we can extend the prediction to a form of density function rather than simply values. The weighted EM algorithm shows an increased sensitivity to recent data.CHAPTER 6. We didn’t take the portfolio selection problem into consideration. represents the market volatility. But it has been observed that many time series may not follow the Gaussian distribution. a portfolio composed of diﬀerent stocks will have diﬀerent rate of returns and risk levels. In this thesis. Identifying useful side information for prediction. Since diﬀerent ﬁnancial time series have diﬀerent characteristics. The end of the 400day training and testing period generates a return of 195%. This way can we output richer information about the market movements. In a 20day training window testing within the same training data. the size of the training pool and training window could be decided by a hill climbing or dynamic programming approach. To overcome this we add a second weight to the exponential weight. we studied the S&P 500 Composite Index.. such as developing of new indicators will help making better predictions. However during highly sensitive periods. they will have diﬀerent patterns. This can be a valuable work to do. we have to balance between shortterm sensitivity and longterm memory. we discussed the application of the Hidden Markov Model to one single ﬁnancial time series. CONCLUSION AND FUTURE WORK 82 changes in either direction. more speciﬁcally. we revise the general EM algorithm to a exponentially weighted EM algorithm.
These weights decide the sensitivity of the model. the path of the Viterbi algorithm will be able to tell what assets should be included in the portfolio.e.. The outputs of the HMGM models we present in the thesis are positive and negative symbols. If we put functions that are estimated to best characterize diﬀerent assets at diﬀerent states. the density can be designed to describe the probability of falling into a range of values. we can expect to ﬁnd better weights.CHAPTER 6. observing the price movements. by how much probability the market will go up or down. Furthermore. CONCLUSION AND FUTURE WORK 83 role in this application. [Cov91] oﬀers a nice introduction to universal portfolio selection problem. i. A more delicate model will be able to output densities rather than symbols. The double weights in the HMGM model can also be trained. so by setting up a function that studies the relationship between sensitivity and accuracy. .
usually derived from a stock’s price or volume. we use the latest x 84 . indicator An indicator is a value. risk Same meaning as volatility. The more volatile a stock or market. Sharpe Ratio Also called the rewardtorisk ratio. volatility Volatility is a measurement of change in price over a given period. and the model should be reﬁtted. Volatile markets and sensitive indicators can cause whipsaws. the more risky the stock is. and the more money an investor can gain (or lose) in a short time. then it should be removed. It is usually expressed as a percentage and computed as the annualized standard deviation of the percentage change in daily price. Rising peaks and troughs constitute an uptrend. trend Refers to the direction of prices. equals the potential reward divided by the potential risk of a position. It moves because for each calculation. falling peaks and troughs constitute a downtrend. whipsaw A whipsaw occurs when a buy or sell signal is reversed in a short time.Appendix A Terminology List outlier Outliers are a few observations that are not well ﬁtted by the best available model. Exponential Moving Average A moving average (MA) is an average of data for a certain number of time periods. Usually if there is no doubt about the accuracy or veracity of the observation. that an investor can use to try to anticipate future price movements. A trading range is characterized by horizontal peaks and troughs.
TERMINOLOGY LIST 85 number of time periods’ data. By deﬁnition. To do this. a moving average lags the market. you have to borrow the asset that you want to short. The reason to sell short is that you believe the price of that asset will go down.APPENDIX A. short To short means to sell something that you don’t have. proﬁt Proﬁt is the cumulative rate of return of a ﬁnancial time series. in an attempt to reduce the lag. multiply today’s actual gain with yesterday’s cumulative return. An exponentially smoothed moving average (EMA) gives greater weight to the more recent data. . To calculate today’s cumulative return.
Syntactic. S. [APM02] [BF96] [Bil97] [Boy83] [BP66] [BS73] [BWT85] [CB04] 86 . Thaler. InputOutput HMM’s for Sequence Processing. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. HMM and IOHMM modeling of EEG rhythms for asynchronous BCI systems. On the Convergence Properties of the EM Algorithm. In Proceedings of the IEEE/IAFE/INFORMS 1998 Conference on Computational Intelligence for Financial Engineering. 1983. 81:637. Springer. 1998. pages 141–156. J. 40:793–805. 1966. 37:1559–1563. Scholes. 1985. 2004. R. A Hidden Markov ModelBased Approach to Sequential Data Clustering . A. Canada. IEEE. Black and M. Weigend A. Journal of Political Economy. Joint IAPR International Workshops SSPR 2002 and SPR 2002. Proceedings. Bicego A. Panuccio and V. Frasconi. TR97021. E. Y. 2002. S. and Statistical Pattern Recognition. Bengio. Annals of Mathematical Statistics. volume 2396 of Lecture Notes in Computer Science. and R. New York. Ser B 44:47–50. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. August 69. 2002. F. D. Chiappa and S. Werner..Bibliography [ADB98] A. Baum and T. In Structural. H. Bengio and P. Ontario. pages 734–742. UC Berkeley. What Drives Stock Returns?An Independent Component Analysis. De Bondt. F. M. International Computer Science Institute. The Pricing of Options and Corporate Liabilities. IEEE Transaction on Neural Networks. Murino. Windsor. A. ESANN. Boyles. L. Back. 7(5):1231–1249. European Symposium on Artiﬁcial Neural Networks. 1996. Bilmes. 1973. 1997. Petrie. Journal of the Royal Statistical Society. M. Does the market overreact? Journal of Finance.
Journal of the Royal Statistical Society. I. 1998. 1991. C. Ser B 39:1–38. Lee Giles. The University of Chicago Press: Chicago. 25:383–417. Tsoi. New York. Cover. G. 1994. 2001. HarperBusiness. S. The hierarchical Hidden Markov Model: analysis and applications. and D. S. CA. E. Digital speech processing. W. Marcel Dekker. San Diego. 1977. Wagner. Universal Portfolios. Series. and C.M. Machine Learning. E. treasury bond with a system of neural networks. Table of Integrals. Wiley. and recognition. Dynamics of Meaning: Anaphora. 1995. S. and Products. G. 1993. and fuzzy systems for chaotic ﬁnancial markets. Mathematical Finance. 1986. Forecasting Economic Time Series. Academic Press. W. The Behaviour of Stock Market Prices. Massachusetts. NeuroVe$t Journal. Neural Network World. N. Singer. 32. Trading on the edge : neural. F. P. 1996. Dempster. Graham. Chierchia. London. Neural Networks for Time Series Processing. A. G. Journal of Finance. Laird. pages 447–468. Furui. Statistical Language Learning. Forecasting the 30year U. 1(129). Fama. Fine. M. 1965. T. J. Journal of Business. San Diego. Machine Learning. Cheng.C. and N.Newbold. and A. 1996. Eﬃcient Capital Markets: A Review of Theory and Empirical Work. genetic.S. Gradshteyn and I. Lawrence. Rubin. C. 2001. Dorﬀner. Deboeck. 1986. Noisy Time Series Prediction using a Recurrent Neural Network and Grammatical Inference. B. 2000. Presupposition and the Theory of Grammar.Ryzhik. Maximumlikelihood from incomplete data via the EM algorithm. Cambridge. The Intelligent Investor: The Classic Bestseller on Value Investing. Tishby. 38:34–105.BIBLIOGRAPHY 87 MIT Press. Charniak. Fama. Granger and P. Academic Press. B. synthesis. S. [Dor96] [Fam65] [Fam70] [FST98] [Fur01] [GI00] [GLT01] [GP86] [Gra86] . [Cha93] [Chi95] [Cov91] [CWL96] [Deb94] [DLR77] E. W. 1970. Y. New York.H Lin. 44:161–183.F.
pages 1503–1515. Giampieri. Cambridge. M. London. McClure. 1988. Davis M. Smyth. IEEE. Cambridge University Press. T. Proceedings of Neural Networks for Signal Processing IX: NNSP’99. Kaufmann. Sch¨ tze. Hughes and P. Plya. MIT Press. San Francisco. and G. Chauvin and M. H. Chapter 9: Markov Models. Wilson Y. F. Maximum conditional likelihood via bound maximization and the cem algorithm. USA. 2000. New York . 1994. Hunkapiller P. Cambridge. Douglas. editors. A tutorial on hidden markov models and selected applications in speech recognition. Y. 2004. Advances in Neural Information Processing Systems. A Hidden Markov Model of Default Interaction. 10.A. Ge and P. The MIT Press. 2000. T. Deformable Markov model templates for timeseries pattern matching. Data mining: concepts and techniques. J. Jebara and A. J. Littlewood. In Foundations of u Statistical Natural Language Processing. IEEE Publishing. Manning and H. Hu. In Second International Conference on Credit Risk. D. 1998.BIBLIOGRAPHY 88 [GS00] X. J. Nicholas. Rabiner. Crowder and G. J. Kamber. [MCG04] [Mit97] [MS99] [Nic99] [PBM94] [Rab89] . pages 317– 379. volume 33(12). In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining . IEEE. PrenticeHall. Upper Saddle River. 1999. 2001. J. In E. Papers in Textlinguistics. Guttorp. Hidden Markov Mixtures of Experts for Prediction of Switching Dynamics.H. pages 1059–1063. 1989. Pentland. In Proceedings of National Academy of Science USA . Mitchell. R. Han and M. Futures. Hidden Markov Models of Biological Primary Sequence Infromation. T. NY. Baldi. Jelinek. Mass. P. Hardy. London. Inequalities. C. McGraw Hill. Statistical methods for speech recognition. volume 77. Options and Other Derivatives. Morgan [HG94] [HK01] [HLP88] [Hul00] [Jel97] [JP98] G. Larsen and S. 1994. pages 257–286. 1997. ACM Press New York. E. volume 91(3).. L. Machine Learning. NJ. In Proceedings of IEEE. pages 81–90. Incorporating spatial dependence and atmospheric data in a model of precipitation. A. pages 195– 204. In Journal of Applied Meteorology. Hull. 1997. 1999. England.
1993. Anomalier: The January Eﬀect. 1997. Certain Observations on Seasonal Movements in Stock Prices. Srivastava. Journal of Business. An Introduction to Mathematical Finance. F. New York. [Sha49] [Sha66] [STW99] [SW97] [Tha87] [Wac42] [WG93] [WMS95] . Ristad and P. volume 2. 1999. R. and H. Timmermann. 28(4):656–715. 12. QC. Walker.. maximum likelihood and the EM algorithm. S. Roweis. 2002. page 20168. [RJ86] [Ros03] [Row00] [RW84] [RY98] L. IEEE ASSP Mag. 1942. Journal of Finance. In Proceedings of the IEEE/IAFE 1997 Conference on Computational Intelligence for Financial Engineering . C. Sharpe. Nonlinear gated experts for time series: Discovering regimes and avoiding overﬁtting. Cambridge. Shannon. Journal of Business. In 16th International Conference on Pattern Recognition (ICPR’02) Quebec City. M. Caelli. Towards EMstyle Algorithms for a posteriori Optimization of Normal Mixtures. S. Weigend. Rabiner and B. Mutual Fund Performance. Lovell R. IEEE International Symposium on Information Theory. Redner and H. M. Gershenfeld. SIAM Review. 1998. R. A. Juang. A. Advances in Neural R. 39:119–138. technical trading rule performance and the bootstrap. Mixture densities. A. Time Series Prediction: Forecasting the Future and Understanding the past. 15:184–193. 1:197–201. 1984. pages 373–399. and A. H. Wachtel. I. Davis and T. In International Journal of Neural Systems. 26(2). Bell System Technical Journal. Canada . 1949. S.A. Ross. 2000. Journal of Economics Perspectives. 1966. 1986.BIBLIOGRAPHY 89 [RIADC02] B. An introduction to hidden Markov models. S. Cambridge University Press. Thaler. Sullivan. Yianilos. C. 2003. E. volume 6. 54:1647–1691. S. AddisonWesley. Weigend and N. Constrained Hidden Markov Models. S. E. Weigend. S. Improved Estimation of Hidden Markov Model Parameters from Multiple Observation Sequences. June:4–16. Information Processing Systems. A. N. 1987. R. W. IEEE. Shi and A. Communication Theory of Secrecy Systems. Mangeas. pages 244–252. Taking Time Seriously: Hidden Markov Experts Applied to Financial Engineering. White. 1995. Datasnooping.
Ghosh. Plamondon. S. On Convergence Properties of the EM Algorithm for Gaussian Mixtures. Parizeau X. 2002. [ZG02] . In IEEE Transaction on Pattern Analysis and Machine Intelligence. Zhong and J. Neural Computation. Li and R. M. pages 371–377. 1996. Wu. F.BIBLIOGRAPHY 90 [Wu83] [XJ96] [XLP00] C. 2:1254–1159. 11:95–103. Training Hidden Markov Models with Multiple ObservationsA Combinatorial Method. The Annals of Statistics. J. HMMs and coupled HMMs for multichannel EEG classiﬁcation. Jordan. Xu and M. IEEE pressing. L. 2000. 1983. volume 22(4). On the Convergence Properties of the EM Algorithm. Proceedings of the IEEE International Joint Conference on Neural Networks. 8(1):129–151.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.