You are on page 1of 23

Density Forecasting Using Hidden Markov Experts

J. P. Morgan & Co. Inc. 60 Wall Street, New York, NY 10260


shi shanming@jpmorgan.com www.cs.colorado.edu shanming

Shanming Shi
Tel: 212-648-1760 Fax: 212-648-5030

Department of Information Systems Leonard N. Stern School of Business New York University 44 West Fourth Street, MEC 9-74 New York, NY 10012, USA
aweigend@stern.nyu.edu www.stern.nyu.edu aweigend

Andreas S. Weigend

Bi Density Forecasting Using Hidden Markov Experts

to address questions about the nature of an observed time series, such as: Are there discrete subprocesses underlying the observed data? If so, do they exhibit a hidden Markov structure, or are they better described by using external variables? Are the sub-processes nonlinear? The answers to these questions are obtained by building predictive models on part of the available data, and evaluating these models on held-out data using several methods that capture both quantitative and qualitative aspects of the predicted densities. Speci cally, we discuss the similarities and di erences between two architectures, gated experts and hidden Markov experts. For the task of predicting the daily distributions of S&P500 returns, the hidden Markov assumption leads to better density forecasts than gated experts. Both architectures are contrasted to a simple superposition of forecasts. Applications of good density forecasts range from building trading models to computing risk measures that capture non-Gaussian tails. putational Finance

Abstract.well suited fora skewed, fat-tailed, and multi-modal time series. This framework allows We present framework for predicting the conditional distributions of future observations that is

Key Words: Hidden Markov models Neural Networks

Density Forecasting

Mixture Models

Com-

INTRODUCTION
In computational intelligence, problems are often dichotomized into either supervised learning regression and classi cation where the desired outcome is known for the training data or unsupervised learning clustering and data mining where no target is supplied but structure to be discovered. This work combines supervised with unsupervised learning. While it focuses on time series prediction, particularly on forecasting the full distribution of a future value of a time series, the framework applies to any task that can be formulated as task-dependent clustering, such as stock picking or customer segmentation. Previous work introduced a model class called mixture of experts Jocobs, Jordan, Nowlan and Hinton 1991, gated experts Weigend, Mangeas and Srivastava 1995, or society of experts Rumelhart, Durbin, Golden and Chauvin 1996. These architectures consist of several so-called experts. The outputs of the experts are nonlinear functions of their inputs and can be interpreted as conditional means supervised learning. There also is a gate that generates the probabilities of the individual experts in response to its inputs unsupervised learning. The gated experts architecture represents a regression model. When used in forecasting, the temporal structure of the time series enters only through the construction of the input-output pairs called patterns. Note that once these patterns have been generated from the raw data, randomizing the order of the training data has essentially no e ect on the resulting model. In the real world, there are time series problems where a regression approach is indeed the appropriate one. However, there are other time series problems where better models de ned by better out-of-sample forecasts in comparison to the regression case can be obtained by taking time into account more directly. The standard connectionist approach to incorporate time history uses recurrent networks. However, recurrent networks have not seen much success on nancial problems where the small amount of signal contained in a typical training set seems to be insu cient to determine the structure and parameters of this unconstrained architecture. Some constraints need to be imposed. We propose the model class of hidden Markov experts Shi and Weigend 1997. This class strikes a balance between time-ignoring regression models and fully recurrent architectures. Hidden Markov experts do take time into account explicitly, yet avoid the di culties of fully recurrent architectures by imposing stringent constraints on the way time enters 1 . The underlying assumptions are: There are several discrete states. Their corresponding functional input-output mapping can be expressed as feedforward networks. These "sub-models" are called experts. At each time step, one and only one expert is responsible for generating the corresponding observation. We do not know which of the experts actually generated the observation the probabilities of the experts for each time step need to be estimated from the data. Modeling the sequence of the hidden states, we assume that dynamics of the hidden states can be described by a rst order Markov process, i.e., the next state depends only on the current state. This is expressed as a matrix of transition probabilities between the hidden states. We do not know these transition probabilities either; they also have to be estimated from the data. Fortunately, the statistically solid framework of hidden Markov models Baum and Eagon 1963 provides algorithms to estimate the unknown quantities. We combine this framework with connectionist techniques and show how we can learn the potentially nonlinear functions of each expert, in addition to the parameters of the transition matrix, as well as the probability vector across states at each time step. One quantity that cannot be determined from the data directly is the number of hidden states. We estimate and test models for di erent numbers of hidden states e.g., from two to six. Furthermore,
1 The key idea gated experts share with hidden Markov experts can be clari ed in comparison to the standard idea of combining predictive models obtained on di erent information sets. These individual models usually weigh all their training points equally Bates and Granger 1969, Granger 1989.

Context

we also need to compare hidden Markov experts with models that do not assume an underlying hidden Markov process but otherwise di er as little as possible. The natural comparison is between hidden Markov experts and gated experts. For both classes, we chose the same number of experts, with identical structure expressed as feedforward neural networks with identical sets of inputs in the two cases, as well as identical noise models of independent Gaussian noise with adaptive expert-speci c variances. The only degree of freedom left is the choice for the inputs into the gate. If a given data set has already been analyzed with hidden Markov experts, an interpretation of the hidden states can guide this choice. For example, if the variances associated with the expert span a wide range, a reasonable gate input might be a volatility measure such as exponentially smoothed squared logged price di erences. The strict out-of-sample evaluation is carried out on test sets after the end of the training period. We report several measures: The likelihood of the test data given the model. This statistic is well suited for comparing the performance of di erent architectures, but since it consists of a single number for each model, it does not give constructive feedback towards an improvement. The probability integral transform approach suggested by Diebold, Gunther and Tay 1998. For each time step, our models predict the full density. Then the corresponding observation comes in. We compute the cumulative probability distribution from the predicted density, and record the value this cumulative takes at the observation. These recorded values should be uniformly distributed. We histogram the recorded values, and also present the correlograms of the time series of this quantity and its powers. The third measure ignores the density predictions the strength of both hidden Markov experts and gated experts and only evaluates the expected value at each time step. This enables the comparison with other methods that only give point predictions by using normalized mean squared error or more robust measures. A hidden Markov model is a parametric stochastic probability model with which a time series can be generated or analyzed. A hidden Markov model has two interrelated processes: a nite-state Markov chain that cannot be observed and an output probabilistic function associated with each state. The Markov chain is de ned by a state transition probability matrix, while the output probability functions, de ned as the observation probabilities or densities also called emission probabilities, may be represented non-parametrically or parametrically. The representation of the observation probabilities is called emission model. From a generating point of view, the Markov chain generates a sequence of discrete states called a path and the emission model generates the time series based on the path. From an analysis point of view, an observed time series shows the evidence about the hidden path. Therefore, in a hidden Markov model the output probabilities impose a veil Ferguson 1980 between the states and the observer of the time series. The task for Markov modeling is to lift the veil. A Hidden Markov model is called hidden" because these states cannot be seen directly from the observed data. We also assume that the hidden process is a Markov process: the probability of the next state depends on the current state and on the transition probability between the two states. Both the states and the observed process can be either discrete or continuous. In speech recognition, the states and the observations are both discrete. In state space models, both states and observations are continuous Harvey 1991, Timmer and Weigend 1997. The discussed Hidden Markov experts use discrete states corresponding to the regimes and continuous observations corresponding to the time series. The model discussed in this paper is called hidden Markov experts instead of hidden Markov models because we focus on the linear and nonlinear emission models. The main problem in hidden Markov modeling is to estimate the parameters given the observed sequence. Baum and Eagon 1963 solved this problem for hidden Markov models with discrete observation densities. Baum, Petrie, Soules and Weiss 1970 extend the algorithm to many of the classical distributions. Hidden Markov models have been widely used in the eld of speech recognition Huang, Ariki and Jack 1990. 4

Related Work

The concept of the transition among states can also be used in modeling the time dependency of regime switching Fraser and Dimitriadis 1994. Poritz 1982 has rst shown that linear prediction analysis can be combined with hidden Markov models. Hamilton 1990 introduced the Markov switches in the context of a vector autoregression. Many applications have appeared in economy and nance analyses since then Engel and Hamilton 1990, Lahiri and Wang 1994, Durland and McCurdy 1994. However, all these applications focused on the point forecasting and the interpretation of the regimes. Hamilton and Susmel 1994 proposed an approach to model the conditional variances within Markov switching framework, where they combined the regime switching process with an autoregressive conditional heteroskedasticity ARCH model by allowing the parameters of an ARCH process to come from di erent regimes. Gray 1996 proposed a more comprehensive method to nest the generalized GARCH model into regime switching model. However, these two models are limited to the rst and second conditional moment of the distribution. This paper focuses on the fact that the Markov switching models are essentially mixtures. By using forward-backward algorithms, hidden Markov experts are more e cient in computation than the Markov switching models. By introducing nonlinear experts, hidden Markov experts can be more exible in modeling dynamics. Moreover, in this paper, we use the model in conditional density forecasting instead of point forecasting Shi and Weigend 1997. Density forecasting is an essential tool in risk management. However, while point forecasting is more common in forecasting literature, a few studies discussed interval forecasts Chat eld 1993, Christo ersen 1997 and probability forecasts Clemen and Winkler 1995, Murphy and Winkler 1992. Diebold et al. 1998 suggest several reasons for the negligence: the di culty of distribution assumption, the di culty in evaluation, and the lack of demands from practice. Diebold et al. 1998 describe a simple way to directly evaluate the density of forecasts. Generally, analytic construction of density forecasts requires restrictive assumptions, but hidden Markov experts construct the density forecasts by assuming that the density is a mixture of Gaussians. Theoretically, however, any format of distribution can be used as the mixture components. This paper is organized as follows: The next section explains the notation, describes the likelihood function of this model, and illustrates the Expectation Maximization EM algorithm used in hidden Markov experts; Then we explain how to generate density predictions using hidden Markov experts and describes the probability integral transform method to evaluating the density; After that we present a detailed simulation example to illustrate our approach using hidden Markov experts to construct the density forecasting; Finally, we use our approach to predict and evaluate the density forecasts of US S&P 500 daily stock returns.

THE ASSUMPTIONS AND ALGORITHM Notation


1. Observations: Y = fy jt = 1; :::; T g, where T is the number of the observations and t is the time. Y refers to the observed time series data. Similarly, X = fx jt = 1; :::; T g represents the input to the emission model. x itself can be a vector or a scalar. For instance, x can be speci ed as x = fy ,1 ; y ,2; :::; y , g; where d is the dimension of the input. x may also include di erent exogenous variables and their lagged values other than y: 2. States: S = f1; 2; :::; j;:::;M g, where M is the number of states in the model and j refers to the j th state. The number of states is usually an assumption of the model. However it is not uncommon that we can nd some evidence that each state corresponds to some physical significance or economic meaning, such as: growth, recession, interest rate conditions, or volatility regions. 3. Transition probabilities: The transition probability denotes the probability of switching from state i to j . A = fa ; i; j  M; a = P s +1 = jjs = ig P where a 0, a = 1, s denotes the state at time t.
T t T T t t t t t t t d t ij ij t t ij j ij t

4. Emission probabilities: The probability of observing y given the state. This probability may also depend on the inputs at time t. B = fb ; j  M; t  T; b = P y js = j; x g
t t j t j t t t

5. Initial probabilities of each state:  = f ; i 2 M g; =1 = 1 For convenience, we use = fA; B; g to denote the parameters of the model. Then the emission probability can be rewritten as P y js ; x ; :
i i t t t

PM

The Likelihood Function

To de ne the likelihood function, we impose the following constraints: Discrete state transitions are rst order Markovian and independent of prior observations P s js ,1; s ,2; :::; s1; Y ,1; X ,1 = P s js ,1 1 With q denoting a path of the states from t = 1 to T , we can write the probability of speci c path q as P q  = P s ; s ,1; :::; s ; :::s1
t t t t t t t T T T T T t

= P s1
t t

T Y

P s js ,1 
t t

2

t=2

Given x and s ,1; earlier values of s and y are irrelevant, P y ; s jq ,1; Y ,1 = P y ; s js ,1; x  which with Eq. 1 implies P y ; s js ,1; x  = P y js ; x P s js ,1
t t t t t t t t t t t t t t t t t T T

3 4

The central problem of HMMs is to nd the parameters, , that most likely t the observed data, Y . Using Eq. 3 and Eq. 4, the likelihood P Y j  is then given as X P Y j  = P Y ; q j 
T T T

= = = =
t t T

X X X X
qT q q qT

P y ; s jq ,1; Y ,1; P Y ,1 ; q ,1j  conditional probability


T T T T T T

P y ; s js ,1; x ; P Y ,1; q ,1j  using Eq. 3


T T T T T T

P y js ; x ; P s js ,1 P Y ,1 ; q ,1j  using Eq. 4


T T T T T T T

P y1 js1z; x1;  P s1 | | z


b1

YT
t=2

Initial state

P y js z x ; P s js ,1  recursion ; | | z
t t T t t b

5

t j

aij

where P y js ; x ;  = b . Therefore, to get the probability P Y j , two probabilities need to be estimated. One is the transition probability P s js ,1, the other is the emission probability given the current state, P y js ; x ; .
t j T t t t t T

Models for the Conditional Emission Probabilities: Experts


6

Independence: Given the input of the emission model, the likelihood of observing y given the current state and the current input is b = P y js = j; x ; . They are independent for each t: We call each of the speci ed emission models an expert, and each individual expert corresponds
t t j t t t

to one state.

Density Function: We can assume di erent formats of distributions. For example, if we assume
Gaussian emission density, then the emission probability of the j th expert becomes

b = P y js = j; x ;   ! b 1 exp , y , y 2 = q 2 2 2 2
t j t t t t t j j

where the y is the mean of the prediction. b Architecture: The experts can have any feedforward architecture. In the simple case of a rst order linear autoregressive model, y is then given by y = k0 + k1 x :We can use linear b b autoregressive models as well as nonlinear neural networks as experts. The emission probability B, is determined by a set of parameters, ; according to the architecture of the emission model such as, k0; k1 and in the autoregressive example. Note that di erent experts can have di erent sets of inputs. Typically, the number of inputs to each expert is a subset of the full set of inputs for a global model. This turns out to be an important advantage that alleviates the e ects of the curse of dimensionality.
t t t t j j

introduced the EM algorithm to maximize this probability. Here, we apply these algorithms to hidden Markov experts. De ne the joint probability of observations y from time 1 to time t and the state at time t; given model parameters as = P y1 ; y2; :::; y ; s = ij  where 1  t  T . We obtain the joint probability of the entire sequence of observations as the sum over the number of the states X P Yj  = =1 : The can be computed recursively with initial probability 1 =  b1 ,
t i t t M i T i T i t+1 j

Computing the Likelihood: Forward-backward Procedure Baum 1972 proposed an elegant algorithm called the forward-backward procedure to calculate P Yj  instead of directly using Eq. 5. The algorithm is linear, in Dempster, Laird and Rubin 1977

"M X
i=1

t i

a b +1
ij t j i i

i i

6

This recursion is called the forward procedure. Given the initial estimates of  and b1; we can compute the probability P Yj , and therefore the likelihood. Similarly, we can de ne the backward variable as the joint probability of the observations y from t + 1 to T given the state at time t and the parameters:
t i t i

= P y +1 ; y +2; :::; y js = i; :
t t T t T i

With the recursive induction starting from


t i

= 1; for all i;

M X j =1

a b +1
ij t j

t+1 j

7

where t = T , 1; T , 2; :::; 2; 1, we can get all the 's for each t. The backward procedure helps exploit the entire observed sequence to estimate the probability P s = j j . With and , we can determine, = P s = ijY ; , the posterior probability of a state at time t given the observations and parameters
t t i t t i

= P s = ijY ;  = P Y ; sYj=ij  P
t t

= =
t i

PM
k =1

P Y ; s = kj 
t t i t k t k

t i

t i

PM i

:
t t t

8

k =1

The probability can be used as the estimation of P s = ij . Similarly, an auxiliary probability, the joint probability of conjunctive states,  +1 = P s = i; s +1 = j jY ; , can also be computed with and as follows +1 =  +1 = P s = i; s Yj  j; Yj  P a b +1 +1 = P P : 9 a b +1 +1 =1 =1
t;t ij t;t ij t t t i ij t j t j M i M j t i ij t j t j

Baum-Welch Algorithm: EM Algorithm for HMMs


old

Equation 5 cannot be maximized directly since the states are hidden. Baum et al. 1970 presented a solution for this problem, which is rooted in the development of the EM algorithm. For the parameters and ; they de ned the auxiliary Q-function.

Q ;  =
old old

8q T
old

P Y ; q j  log P Y ; q j 
T T old T T old T T old

10

They have shown that Q ;  Q ;  = P Y j  re-estimation is called Baum-Welch algorithm.

P Y j  Liporace 1982. The

Expectation Step:

In the expectation step, for each t; the probabilities and , and in turn, the posterior probabilities and  , are calculated based on the current estimation of according to Eq. 8 and 9, respectively. The posterior is the central quantity in this calculation.

Maximization Step:

The papers of Baum et al. 1970, Juang 1984, and P Liporace 1982 proved the convergence of P the algorithm. Maximizing Eq. 10 with the constrains, =1  = 1 and =1 a = 1 yields the re-estimation formulas 1. Initial probabilities:  = 1 2. Transition probabilities:
M i M j i ij i i

expected number of transitions i a = expected number of transitions fromfrom stateto to j state i anywhere =
ij

P t;t+1  t P ij t
t i

3. Emission parameters: In the original work, Baum et al. 1970 only estimated the unconditional density of the observations. In this research, parametrical emission models are assumed. For each individual emission model, maximizing Eq. 10 is the same as maximizing the following Fraser and Dimitriadis 1994

G=
j

T M XX t=1 j =1

t j

log P y jx ; s = j; 
t t t j j

11

where represents the parameters of the emission model of state j . Equation 11 can be viewed as a cost function for the emission model. The computation of the parameter depends on the emission model. We assume the error to be Gaussian distributed and use neural networks as experts. Therefore, we have two sets of parameters: the parameters of the linear model or the weights of neural networks, ; and the variances of the Gaussian noise model, 2. From Eq. 11 @P y jx ; s = j;  @G = X @ P y jx ; s = j;  @ =1 X @y b = y , y  @ b 12 2 =1
j j T t j t t t j j t t t t j j T t j j t t j t j t j

where y is the prediction from the j th expert and for the variance b
j t

yielding

@G = X @P y jx ; s = j;  2 @ P y jx ; s = j;  @ 2 =1
T t j t t t j j t t t t j j

2
j

PT
t=1

PT

t j

y , y 2 b
t t j t j

13

This equation can be easily extended to the multivariate case. For linear emission models, maximizing Eq. 11 is identical to minimizing
T X t=1 t j

t=1

y , y 2 b
t t j

where y = k0 + kx and k and x can be vector or scalar. That is the sum-square-error weighted b by ; the e ective learning rate.
t j t j t t

DENSITY PREDICTIONS AND THEIR VALUATION


After the discussion of the basic framework of the model, we now turn to the key question: how to make predictions with hidden Markov experts. In predictions, we cannot use Eq. 8 to estimate the state, because it is estimated from the entire observation, which includes future information. However, given the sequence of observations through time t, we can estimate the predictive probability of a state in terms of the transition probabilities a and the joint probability of state s = j at time t +1 and the observations through time t; as ; +1 P s +1 = j jY ;  = P Y P sY j = j j    P =1 = P P a =  =1 a  =1
ij t i t t t t t M i t i ij M j M i t i F j ij

Generating the Forcasts for the Densities

For convenience, P s +1 = j jY ;  is denoted as . Due to the linear superposition of the experts, the overall expected value is the weighted linear superposition of the individual expected values. The expectation for y at time t + 1 is X y +1 = b yb +1 : 14
t t F j M t F j j t

The density of y

j =1 t+1

is given by

P y +1 jY ;  =
t t

M X j =1

P y +1 jx +1; s +1 = j; P s +1 jY ; 
t t t j t t F j

15

M X j =1

P y +1 jx +1; s +1 = j; 
t t t j t t t t j j j

If we assume Gaussian distributions for the individual noise models, P y +1 jx +1; s +1 = j;  is determined by two sets of parameters, the mean y +1 and the variance 2 . Therefore, we can estimate b the full distribution of y +1 :
t

A straightforward method for evaluating di erent models is to compute the likelihood of out-ofsample data given the density predictions of the model. This allows the direct comparison between di erent model classes. An alternative that goes beyond a single number and can provide constructive feedback to improve the model was suggested by Diebold et al. 1998 and uses the cumulative probability. Let P y jC  be the process generating a series of observations y given all relevant conditioning variables C . Let b Y = fy jt = 1; :::; T g be the corresponding series of the realizations. Let P y jx  be the 1-stepahead density forecasts of y ; where x C : The method of Diebold et al. 1998 is based on the relationship between the data generating process, P y jC  and the sequence of the density forecasts, b P y jx : Let the variable Z be the probability integral transform of y with respect to P y jC , which is the cumulative probability of y ;
t t t t t T t t t t t t t t t t t t t t t t t t t

How to Evaluate the Density Forecasts

Z =
t t t t

Diebold et al. 1998 give the followingproposition: Suppose a series Y is generated from fP y jC g =1. b If a sequence of density forecasts fP y jx g =1 coincides with fP y jC g =1 ; we have
T t t t T t

,1

P ujC du
t t t t t T t

fZ g =1 = f
t T t

T t

,1

b P ujC dug =1  U 0; 1:


t t T t iid T t t t T t

16

b That is, the sequence of probability integral transforms of Y with respect to fP y jC g =1 is iid uniformly distributed. Therefore, given a sequence of density forecasts, we can construct the probability integral transforms fZ g =1 : We can then test if the distribution of Z is iid uniformly distributed with some standard test method, such as Kolmogorov-Smirnov test. However, as pointed out by Diebold et al. 1998, all the test methods are actually joint tests of uniformity and iid. If the test is rejected, we do not know what actually causes the rejection. Such tests are not very useful in practical problems since they are not helpful in improving modeling. Instead, a two step approach is suggested. First, to evaluate unconditional uniformity, they recommend visual evaluation of the density estimates, such as a histogram. Second, to evaluate whether Z is iid, they suggest using the correlogram of Z , Z ; where Z is the mean of Z . To justify potentially sophisticated forms of dependence in addition to linear dependence, they further advocate to examine the correlogram of the powers of Z , Z :
t T t t

10

EXAMPLE 1: COMPUTER GENERATED DATA


To illustrate the idea of the approach and demonstrate the results when the switching is known. we use a time series generated by a two-state hidden Markov process. We compare three model classes on these data: the naive method of using an unconditional Gaussian, gated experts, hidden Markov experts.

Generation models
Data sets: 20,000 data points are generated from the two-state hidden Markov process. The rst 10,000 data are used as the training set, the next 5,000 data points are employed as the validation set, and the remaining 5,000 data points are used as the test set. The hidden Markov data is a time series that switches between a trending and a mean reverting process with iid Gaussian innovations according to a Markov process. The two processes are y +1 = 0:5y + 0:8" +1 if in state 1 y +1 = ,0:3y + 0:5
+1 if in state 2
t t t t t t

where " and


are both N 0; 1 iid. The transition probabilities matrix for the two states is   0:98 0:02 A = 0:03 0:97
t t

Recognition models
The parameters of the naive method are the unconditional mean and the unconditional variance. We can estimate them from the training set and then use them on the test set. Gated experts: EM algorithm is used to estimate the parameters of gated experts Weigend et al. 1995. Two linear autoregressive models are used as the experts and one nonlinear neural network is used as the gate. The neural network has three hidden tanh units and two linear output units. One lagged value of the observation is used as the input for both the gate and the experts. In addition, the gate also includes one lagged value of the exponential moving average of the square-of-observation series as a second input, which can be viewed as a volatility estimation. Hidden Markov experts: Two linear autoregressive models are used as the experts. One lagged value of the observation is used as the input to the experts.

First, we use the naive model based on the false assumption that the process is iid N ; . Figure 1 shows the histogram and the correlograms of the probability integral transformed Z series: Z , Z , Z , Z 2, Z , Z 3 , Z , Z 4 of the results. The dotted lines are the two standard deviations of the estimations. As expected, the histograms and the correlograms show the non-uniformity of 11

Results

Histogram of Z using unconditional Gaussian for time series generated from HMEs 1.5 1 0.5 0 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.1 0.2 0.3 0.4 Autocorrelation of (Zmean(Z))

0.5 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.6

0.7 0.8 0.9 Autocorrelation of (Zmean(Z))2

50 100 150 Autocorrelation of (Zmean(Z))3

200

50 100 150 Autocorrelation of (Zmean(Z))4

200

50

100

150

200

50

100

150

200

Figure 1: The histogram and correlogram of the probability integral transform on hidden Markov data with unconditional Gaussian as the recognition model. The top panel shows the histogram of the probability integral transform, Z; series. The bottom four panels show the correlograms of the Z series and its powers. The two dotted lines are the two standard deviations. We can see that the unconditional estimation is not uniformly distributed and there are autocorrelations shown in the correlograms. the histogram and the strong autocorrelations, since the model has no predictability. These results indicate the poor forecasts of density with the naive model. Then we use gated experts to forecast the time series. Figure 2 shows the estimation results of the gated experts. The top panel shows part of the test set of the simulated data, the middle panel shows the one-step-ahead predictions of the model on the same data set, and the bottom one shows the segmentation extrapolated from the model versus the true segmentation that the data is generated. The segmentation 2is poorly estimated with this model. The normalized-mean-square-error ENMS, P t t ENMS= Ptt t,b 2 : This compares the model to using the unconditional mean of the time series for , all prediction. of the test set is 0.8858. Figure 3 shows the evaluation results of the gated experts with the probability integral transform method. The results show clear uniformity and less autocorrelations. This indicates that the gated experts are better in predicting the density. The correlation of Z , Z  and Z , Z 3 come from the poor segmentation, and consequentially, poor regressions arise. Figure 4 shows the estimation results of the hidden Markov experts in comparison to Fig. 2. The top panel shows part of the test set of the simulated data, the middle panel shows the one-step-ahead predictions of the model on the same data set. The ENMS is 0.8255. The bottom panel shows the segmentation estimated from the model versus the true segmentation that the data is generated from. We can see that the model recovered the segmentation correctly. Similarly, Fig. 5 shows the evaluation results of the hidden Markov experts. The results reveal that the Z series is uniformly distributed and there is no signi cant autocorrelation left. From Fig. 5, we can see that the histogram and the correlograms are almost as perfect as expected since we are forecasting with a correctly speci ed model. Table 1 gives the statistics of the parameter estimations: the diagonal elements of transition probability matrix A, the autoregression coe cients , and the standard deviations of the Gaussian noise . We can see that the hidden Markov experts found the correct parameters. Not surprisingly, the estimation of the gated experts is much worse than that of the hidden Markov experts, because the hidden Markov experts have the right model
y y y y

12

Time Series Generated from HMEs 2 0 2 5100 5200 5300 5400 5500 5600 Predictions 5700 5800 5900 6000

2 0 2 5100 1 probability 5200 5300 5400 5500 5600 Segmentation from GEs 5700 5800 5900 6000

0.5

0 5100 5200 5300 5400 5500 time 5600 5700 5800 5900 6000

Figure 2: Forecasts and segmentation of gated experts on simulated hidden Markov data. The top panel shows 1000 data points of the test set from the simulated observations of the hidden Markov process. The middle panel shows the corresponding predictions from gated experts. The bottom panel shows the regimes found by one expert compared to the true segmentation on out-of-sample simulated data. The dot-dashed line shows the true value used when the data was generated, the solid line shows the estimated probability of the model. The sum of the regime shown and the regime of the other expert equals to one. structure for the data and gated experts do not. Table 1: Summary of the experiments on the computer simulations. For the transition probabilities on the main diagonal a , the autoregression coe cients  , and the noise levels , we give the true values as well as the mean and standard error of their estimates. Parameters a11 a22 1 2 1 2 true value 0.98 0.97 0.5 0.3 0.8 0.5 HMEs 0.976 0.969 0.507 0.269 0.808 0.492 standard error 0.013 0.015 0.012 0.017 0.008 0.005 Gated Experts - 0.466 0.003 0.867 0.528 standard error - 0.016 0.01 0.009 0.005
ii i i

Table 2 shows the mean and the standard deviation of log-likelihood of the test set using the naive method, the log-likelihood of the gated experts and the hidden Markov experts. It indicates that the likelihood of the gated experts and the hidden Markov experts are signi cantly better than the naive model. We also did another experiment to test our method. We generated a data set from Gaussian distribution N 0; 1 and then use the three approaches: unconditional Gaussian, gated experts, and hidden Markov experts to model the data. Since the data is generated from N 0; 1; the unconditional Gaussian should be the correct model. All of the three, however, found the correct distribution evaluated with the integral transform method. Two experts are used for gated experts and hidden Markov experts. The transition proababilty of the hidden Markov model is around 0.5, since there is no regime in the data. 13

Histogram of Z using Gated Experts for time serise generated from HMEs 1.5 1 0.5 0 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.1 0.2 0.3 0.4 Autocorrelation of (Zmean(Z))

0.5 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.6

0.7 0.8 0.9 Autocorrelation of (Zmean(Z))2

50 100 150 Autocorrelation of (Zmean(Z))3

200

50 100 150 Autocorrelation of (Zmean(Z))4

200

50

100

150

200

50

100

150

200

Figure 3: The histogram and correlogram of the probability integral transform on hidden Markov data with gated experts as the recognition model. We can see that the Z series is uniformly distributed. The bottom four panels show the correlograms of the Z series and its powers. However, Z , Z  and Z , Z 3 have certain correlation left. Table 2: Log-likelihood of the experiments on the hidden Markov simulation data naive method GEs HMEs Log Likelihood -1.1811 -1.0736 -1.0361 Standard Deviation 0.0174 -

EXAMPLE 2: S&P500 RETURNS


We applied hidden Markov experts to S&P500 data and compared them to the results of using gated experts. We also used the probability integral transform method to evaluate the forecasts of densities. The total 21 years of relative returns y = log price , log price ,1 of daily S&P500 data is divided into three sets: training set 10 years data from 1 3 77 to 12 31 86, crash set the stock market crash period from 1 3 87 to 3 1 88, and the test set about 10 years data from 3 2 88 to 12 31 97.
t t t

Description of the Data

Speci cation of the Model

Hidden Markov experts: To better estimate the density, hidden Markov experts utilize four linear autoregressive models as the experts. Seven lagged values of the returns as well as seven lagged values of the exponential moving average of the square-of-returns are used as the inputs to the experts. Gated experts: Similarly, four linear networks are used as the experts. One neural network with ve tanh hidden units and four linear output units are used as the gate. Seven lagged values of the returns in addition to seven lagged values of the exponential moving average of the square-of-returns are used as the inputs for the gate. Seven lagged values of the returns are used as the inputs to the experts.

Estimated Parameters

Using the hidden Markov experts, the estimated transition probability matrix is The order of state 1 to state 4 is arranged according to the decreasing of the variance of each state. 14

Time Series Generated from HMEs 2 0 2 5100 5200 5300 5400 5500 5600 Predictions 5700 5800 5900 6000

2 0 2 5100 1 5200 5300 5400 5500 5600 5700 Segmentation from HMEs vs. True 5800 5900 6000

0.5

0 5100 5200 5300 5400 5500 5600 5700 5800 5900 6000

Figure 4: Forecasts and segmentation of hidden Markov experts on simulated hidden Markov data. The regime found by hidden Markov experts is clear and close to the true segmentation.

0:9173 0:0145 0:0605 0:0077 3 6 0 0010 0 9673 0 0 0002 7 A = 6 0::0165 0::0016 0::0315 0::0020 7 4 5 9799 0:0058 0:0005 0:0003 0:9934
The standard deviation of each expert
j

is estimated in Table 3.

Table 3: Standard Deviation of Each Expert of Hidden Markov Experts. Expert 1 Expert 2 Expert 3 Expert 4 HMEs 1.366 1.004 0.767 0.610 Standard Error 0.057 0.051 0.016 0.015 GEs 1.532 1.032 0.665 0.637 Standard Error 0.076 0.033 0.018 0.012 The values in the transition matrix indicate that the states captured by each expert are very stable, since the diagonal of the transition matrix is close to one. However, the ENMS of hidden Markov experts on the test set is 1.0199 and the ENMS of hidden Markov experts on the test set is 1.0391. This means that if we only measure the prediction errors on the mean, the predictions from both models are not better than by using the unconditional mean.

Evaluating the Density Forecasts

From the ENMS, it seems that there is no predictability in the S&P 500 data set. Figure 6 shows the histogram and the correlogram of the probability integral transform Z series on the test set using the unconditional Gaussian. We can see, however, that the histogram of Z series is not uniform and the correlation remains. Figure 7 and Fig. 8 demonstrate the segmentation results and the evaluation results of using gated experts on the S&P500 data. We can see that there is no clear regime found in any period of the time 15

Histogram of Z using Hidden Markov Experts for time series generated from HMEs 1.5 1 0.5 0 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.1 0.2 0.3 0.4 Autocorrelation of (Zmean(Z))

0.5 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.6

0.7 0.8 0.9 Autocorrelation of (Zmean(Z))2

50 100 150 Autocorrelation of (Zmean(Z))3

200

50 100 150 Autocorrelation of (Zmean(Z))4

200

50

100

150

200

50

100

150

200

Figure 5: The histogram and correlogram of the probability integral transform on hidden Markov data with hidden Markov experts as the recognition model. We can see that the Z series is uniform distributed and there is no autocorrelation remaining. series. We can also see that the histogram of the probability integral transform plot of the test set is getting close to a uniform distribution, but there are still autocorrelations remaining in the Z series Figure 9 and Fig. 10 exhibit the prediction and evaluation results of hidden Markov experts on S&P 500 data. There are clear regimes shown in Fig. 9. The histogram shows that the Z series is more uniformly distributed. The correlograms reveal no signi cant correlations in the Z series. Figure 11 shows the distributions of the log-likelihood of the gated experts and the hidden Markov experts on this data set on 200 di erent runs, each starting from di erent initial points The density in the graph is estimated with Parzen window with the width equals to 0.01 While both of the distributions are distinguishably better from the likelihood of the unconditional Gaussian, the loglikelihood distribution of using hidden Markov experts has less variance then the gated experts.

CONCLUSIONS

This paper presented the theory of hidden Markov experts. We then explained the algorithm to predict density by using hidden Markov experts. We also discussed using the probability integral transform method to evaluate the density forecasts. In contrast to the earlier works of Markov switching models, we focused on the full conditional density distributions rather than only conditional mean and conditional variance. The simulated time series is used to show our approach to perform density forecasting. From the experiment, we can see that hidden Markov experts can correctly identify the parameters and can predict the density correctly under the criteria of the probability integral transform method. We then applied the approach to the S&P 500 data. In the experiment, both the gated experts and the hidden Markov experts show no predictability in determining the conditional mean. However, it is important to note that both models can still be useful for predicting the density. For this data set, the hidden Markov experts are better models than the others. In comparison with the traditional models, hidden Markov experts partition the regimes in time, whereas the gated experts or traditional mixture models partition the regions in space. With the proposed method, we can get the full density distribution rather than only a mean and the variance. This paper assumes that the component distributions are Gaussian. The generalization to other distributions is straightforward. 16

Histogram of Z using unconditional Gaussian for SP500 returns 2 1.5 1 0.5 0 0.2 0.1 0 0.1 0.2 0.1 0 0.1 0 0.1 0.2 0.3 0.4 Autocorrelation of (Zmean(Z)) 0.5 0.2 0.1 0 0.1 0.2 0.1 0 0.1 0.6 0.7 0.8 0.9 Autocorrelation of (Zmean(Z))2 1

50 100 150 Autocorrelation of (Zmean(Z))3

200

50 100 150 Autocorrelation of (Zmean(Z))4

200

50

100

150

200

50

100

150

200

Figure 6: The histogram and correlogram of the probability integral transform on S&P 500 data with unconditional Gaussian as the recognition model. We can see that the Z series is not uniformly distributed. There are correlations remaining in the Z series. The focus of this paper was to explain and illustrate the algorithm. We did not optimize the choice of inputs but simply use lagged values of the observations, clearly leaving space for improvement. In this study, we only use univariate outputs. The algorithm can be easily extended into multivariate cases. Applications of good density predictions in nancial engineering span a wide spectrum that includes building trading models, constructing matching derivatives, and computing risk-measures that capture non-Gaussian tails. In summary, this paper pushes the parallels between gated experts and hidden Markov experts. The two model classes only di er in the information used to compute the weights of the components of the predicted density, by either using a gate with external inputs to nd regions, or by assuming an underlying Markov process for the regimes. From a scienti c perspective, this paper can be viewed as a framework to investigate which real-world time can be modeled with an underlying hidden Markov process such as the example of daily S&P500 prices given, and which don't. We hope that this will lead to further insights and understanding. Software for both gated experts and hidden Markov experts, implemented in Matlab by Shanming Shi, is available at http: www.stern.nyu.edu ~aweigend Research Software. Any feedback is appreciated.

References

Bates, J. M. and Granger, C. W. J. 1969. "The combination of forecasts", Operations Research Quarterly 20: 451 468. Baum, L. E. 1972. "An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov process", Inequalities 3: 1 8. Baum, L. E. and Eagon, J. A. 1963. "An inequality with applications to statistical prediction for functions of Markov processes and to a model for ecology", Bull. Amer. Math. Soc. 73: 360 363. 17

Baum, L. E., Petrie, T., Soules, G. and Weiss, N. 1970. "A maximization technique ocurring in the statistical analysis of probabilistic functions of Markov chains", Annals of Mathematical Statistics 41: 164 171. Chat eld, C. 1993. "Calculating interval forecasts", Journal of Business and Economics Statistics 11: 121 135. Christo ersen, P. 1997. "Evaluating interval forecasts", International Economic Review, Forthcoming. Clemen, R.T., A. M. and Winkler, R. 1995. "Screening probability forecasts: Contrasts between choosing and combining", International Journal of Forecasting 11: 133 146. Dempster, A. P., Laird, N. M. and Rubin, D. B. 1977. "Maximum likelihood from incomplete data via the EM algorithm", Journal of Royal Statistical Society B 39: 1 38. Diebold, F. X., Gunther, T. A. and Tay, A. S. 1998. "Evaluating density forecasts", Review of economics and statistics, Forthcoming. Durland, J. M. and McCurdy, T. H. 1994. "Duration-dependent transitions in a Markov model of u.s. gnp growth", Journal of Business and Economic Statistics 12: 279 288. Engel, C. and Hamilton, J. D. 1990. "Long swings in the dollar: Are they in the data and do markets know it?", The American Economic Review 80: 689 713. Ferguson, J. D. 1980. "Hidden Markov analysis: an introduction", in J. D. Ferguson ed., The Symposium on the Applications of Hidden Markov Models to Text and Speech, Princeton, NJ, pp. 143 179. Fraser, A. M. and Dimitriadis, A. 1994. "Forecasting probability densities by using hidden Markov models", in A. S. Weigend and N. A. Gershenfeld eds, Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley, Reading, MA, pp. 265 282. Granger, C. W. J. 1989. "Combining forecasts Twenty years later", Journal of Forecasting 8: 167 173. Gray, S. F. 1996. "Modeling the conditional distribution of interest rates as a regime-switching process", Journal of Financial Economics 42: 27 62. Hamilton, J. D. 1990. "Analysis of time series subject to changes in regime ", Journal of Econometrics 45: 39 70. Hamilton, J. D. and Susmel, R. 1994. "Autoregressive conditional heteroskedasticity and changes in regime", Journal of Econometrics 64: 307 333. Harvey, A. C. 1991. Forecasting, Structural Time Series Models and the Kalman Filter, Cambridge, U.K. Huang, X. D., Ariki, Y. and Jack, M. A. 1990. Hidden Markov Models for Speech Recognition, Edinburgh University Press, Edinburgh. Jocobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. 1991. "Adaptive mixtures of local experts", Neural Computation 3: 79 87. Juang, B. H. 1984. "On hidden Markov model and dynamic time warping for speech recognition-a uni ed view", AT&T BLTJ 63: 1213 1243. Lahiri, K. and Wang, J. G. 1994. "Predicting cyclical turning points with leading index in a Markov switching model", Journal of Forecasting 13: 245 263. 18

Liporace, L. A. 1982. "Maximum likelihood estimation for multivariate observations of Markov sources", IEEE Trans. Inform. Theory IT-28: 729 734. Murphy, A. H. and Winkler, R. L. 1992. "Diagnostic veri cation of probability forecasts", International Journal of Forecasting 7: 435 455. Poritz, A. B. 1982. "Linear predictive hidden Markov models and the speech signal", Proc. ICASSP'82, Paris, France, pp. 1291 1294. Rumelhart, D. E., Durbin, R., Golden, R. and Chauvin, Y. 1996. "Backpropagation: The basic theory", in P. Smolensky, M. C. Mozer and D. E. Rumelhart eds, Mathematical Perspectives on Neural Networks, Lawrence Erlbaum Associates, Hillsdale, NJ, pp. 533 566. Shi, S. and Weigend, A. S. 1997. "Taking time seriously: Hidden Markov experts applied to nancial engineering", Proceedings of the 1997 IEEE IAFE Conference on Computational Intelligence for Financial EngineeringCIFEr'97, Piscataway, NJ:IEEE Service Center, pp. 244 252. Timmer, J. and Weigend, A. S. 1997. "Modeling volatility using state space models", International Journal of Neural Systems. Weigend, A. S., Mangeas, M. and Srivastava, A. N. 1995. "Nonlinear gated experts for time series: Discovering regimes and avoiding over tting", International Journal of Neural Systems 6: 373 399.

19

Returns of SP500 (from 01/02/77 to 12/31/97) trun returns Predictions 5 0 5

training set

crash

test set

Segmentation of GEs on SP500 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 May79 Feb82 Nov84 Aug87 May90 Jan93 Oct95

Figure 7: Segmentation of gated experts on S&P500 data. From the top to bottom, the rst panel shows the returns of the whole period. The second panel shows the predictions of the gated experts. The next four panels display the segmentations of each expert, in the order of high variance experts to low variance experts.

20

Histogram of Z using Gated Experts for SP500 returns 1.5 1 0.5 0 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.1 0.2 0.3 0.4 Autocorrelation of (Zmean(Z))

0.5 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.6

0.7 0.8 0.9 Autocorrelation of (Zmean(Z))2

50 100 150 Autocorrelation of (Zmean(Z))3

200

50 100 150 Autocorrelation of (Zmean(Z))4

200

50

100

150

200

50

100

150

200

Figure 8: The histogram and correlogram of the probability integral transform on S&P 500 data with gated experts as the recognition model. The Z series is almost uniformly distributed. However, there are correlations remaining in the Z series.

21

Returns of SP500 (from 01/02/77 to 12/31/97) True returns Predictions 5 0 5

training set

crash

test set

Segmentation of HMEs on SP500 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 May79 Feb82 Nov84 Aug87 May90 Jan93 Oct95

Figure 9: S&P500 data. From the top to bottom, the rst panel shows the returns of the whole period. The second panel shows the predictions of the hidden Markov experts. The next four panels display the segmentations of each expert, in the order of high variance experts to low variance experts. Due to model assumption, the segmentation of the hidden Markov experts is more clearer than that of the gated experts.

22

1.5 1 0.5 0 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.1 0.2 0.3 0.4 Autocorrelation of (Zmean(Z))

0.5 0.2 0.1 0 0.1 0.2 0.1 0 0.1

0.6

0.7 0.8 0.9 Autocorrelation of (Zmean(Z))2

50 100 150 3 Autocorrelation of (Zmean(Z))

200

50 100 150 4 Autocorrelation of (Zmean(Z))

200

50

100

150

200

50

100

150

200

Figure 10: The histogram and correlogram of the probability integral transform on S&P 500 data with the hidden Markov experts as the recognition model. We can see that the Z series is close to being uniformly distributed and there are no signi cant correlations remaining in the Z series.

150

100

Median of HMEs Median of GEs Log likelihood of Unconditional Gaussian

HMEs

50

GEs

0 1.24

1.23

1.22

1.21

1.2 1.19 1.18 Log Likelihood / N

1.17

1.16

1.15

1.14

Figure 11: Log-likelihood distribution of hidden Markov experts and gated experts.

23

You might also like