\documentclass[]{article}
\usepackage{amf}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{psfrag}
% derived from ~alexis/work/text/afad.ltx
\newcommand{\field}[1]{\mathbb{#1}}
\newcommand{\REAL}{\field{R}}
\newcommand{\INTEGER}{\field{Z}}
\def\F{{\cal F}}
\def\x{{\bf x}}
\def\X{{\bf X}}
\def\y{{\bf y}}
\def\z{{\bf z}}
\def\Z{{\bf Z}}
\def\ts{{\tilde s}}
\def\a{{\bf a}}
\def\s{{\bf s}}
\def\ybar{\bar y}
\def\thetahat{{\bf \hat \theta}}
\def\btheta{{\bf \theta}}
\def\bra{\left<}
\def\ket{\right>_{\s |\y, \btheta}}
\def\qket{\right>_{q |y_1^T, \btheta}}
\def\argeq{{\scriptscriptstyle ^=}}
\def\argplus{\!{+}\!}
\def\argminus{\!{-}\!}

\begin{document}
\begin{center}
  \Large Hidden Markov Models with Mixed States\footnote{This is
    derived from the revised version that we submitted on April 17,
    1993.  The software that we used to insert figures into that
    version is not on our current system.  In September of 2001, I
    (Andy Fraser) edited the document to include the figures so that I
    could make whole ps and pdf files for
    distribution.}   \\
  \large
  Andrew M. Fraser and Alexis Dimitriadis \\[10pt]
  \normalsize
  Systems Science Ph.D. Program \\
  Portland State University\\
  Portland, Oregon 97207-0751\\{\em andy@sysc.pdx.edu}
\end{center}


\begin{abstract}
We note similarities of the state space reconstruction (``Embedology'')
practiced in numerical work on
chaos, state space methods of stochastic systems theory, and the hidden
Markov models (HMMs) used in speech research.  We review Baum's EM
algorithm in general and the specific forward-backward algorithm that
optimizes a class of HMM that has a mixed state space consisting of
continuous and discrete parts.  We then describe forecasts based on models
fit to data set $D$.
\end{abstract}

\section{Introduction}

In the first part of this paper we hope to provide an intuitive
explanation of hidden Markov model (HMM) methods that builds on the
notion of state space
reconstruction.  Later, we provide enough details about the approach
to enable a careful reader to develop new variants and to write his
own programs.  We begin with some thoughts on forecasting and state
spaces.
We were drawn to work on varieties of hidden Markov models for scalar
time series from chaotic dynamics, by
the similarity of the notion of {\em reconstructed state space}
in the chaos literature to the notion of {\em hidden state} in
the HMM literature.  The approach provides forecasts that consist of
probability densities instead of single guesses of future values.
In section \ref{sec:forecast}, such a forecast of data set $D$
suggests that the approach is quite powerful.

The most direct method of forecasting is to search the past for times
when conditions matched the patterns of recent observations and then
guess that what happened before will happen again.  While forecasts
for discrete valued periodic sequences like $(0,1,2,3,0,1,2,3,0,1,\ldots)$
are trivial, forecasting sequences like data set $D$,
$(0.643,0.558,0.484,0.434,0.422,\ldots)$, is difficult because there
is no segment in the recorded history that exactly matches recent
observations.  In such circumstances one may proceed by {\em guessing}
how {\em close} conditions at various times in the past are to present
conditions and then appropriately averaging near matches to make
forecasts; here {\em closeness} corresponds to distance in {\em state
space}, and conditional probability density functions for location in
state space given observations implement the {\em guesses}.

The notion of state space has also been essential in the efforts
over the past dozen years in which researchers have claimed
that various experimental time series arise from chaotic dynamics.
Characteristically scalar time series are converted to vector time
series in a procedure called {\em state space reconstruction}.  The
simplest procedure is the use of {\em delay vectors}.  Using a
notation in which a sequence of scalar observations is denoted by
$y_1^T \equiv \left(y(1),y(2),\ldots,y(T)\right)$, we write delay
vectors as $\x(t) \equiv y_{t-m}^{t-1}$, and observe that $\x_{m+1}^T$
can be obtained from $y_1^T$.  Having reconstructed a vector time
series, investigators generally argue that the dynamics are
deterministic, i.e., $\exists~\F:\REAL^m \rightarrow \REAL^m$ such that
$\x(t+1) = \F\left(\x(t)\right)~\forall t$, and then estimate
invariants such as dimensions, entropies, or Lyapunov exponents from
experimental measurements.

While the practice of explaining scalar time series in terms of vector
dynamics and using observed time series segments to specify locations
in state space has a long history\footnote
    {Most control theory text books have a section titled {\em Observability}
    which addresses the question of whether or not sequences of observations
    uniquely determine locations in state space.},
the more recent literature in chaos usually cites Packard
{\em et al.}\cite{Packard80} or Takens\cite{Takens81}.  The idea is
that an ``original'' state space variable $\z$ evolves via a
diffeomorphism\footnote
   {A one to one differentiable function with a one to one
   differentiable inverse.}
$F$ of some low dimensional manifold $\Z$ and gives rise to a scalar
observable $y\in\REAL$
%
\begin{subequations}
\label{eq:ssp}
  \begin{align}
    \z(t+1) & =  F\left( \z(t) \right) \\
    y(t)   & =  g\left( \z(t) \right),
  \end{align}
\end{subequations}
%
and a reconstruction function $\phi:\REAL^w\rightarrow \REAL^m$ is
used to map windows of observations $y_{t-w}^{t-1}$ of size $w$ to
reconstructed vectors $\x(t)\in\X = \REAL^m$.  Combining
$\phi,F^{-1},$ and $g$ one can write $\Phi:\Z\rightarrow\X$ with
\[
\x(t) = \Phi \left( \z(t) \right) = \phi \left( g(F^{-1}(z(t))),
g(F^{-2}(z(t))),\ldots,
g(F^{-w}(z(t))) \right).
\]
Takens showed that if $g$ and $\phi$ are differentiable and $m$ is large
enough, then it is a generic property that $\Phi$ is
a diffeomorphism, and thus one expects coordinate invariant properties
of trajectories and limit sets in $\Z$ to be the same as the
properties of their images in $\X$.  Although Takens' result is
insensitive to the details of $\phi$ and $g$, when experimenters
implemented the procedure, they found that their estimates of invariants
varied with changes in $\phi$ and $g$.

This variability of invariants lead to a literature (recently called
{\em Embedology}\cite{Sauer91}) concerned with defining and finding
{\em good reconstructions}\cite{Fraser89physica}.  The variability is
usually explained by observing that the procedures used to estimate
invariants converge in the limit of infinite amounts of noise free
data, but that experimenters work with short noisy data sets instead.
While it may be true that different applications lead to different
optimal reconstructions, we suspect\footnote
   {This was suggested to us by Henry Abarbanel\cite{Abarbanel89mdl}
   and is similar to the notion of Casdagli {\em et al.}\cite
   {Casdagli91physica} that reconstruction and prediction are related.}
that practical embedology will best be developed as an aspect of
modeling techniques for optimizing likelihood or a variant such as MDL
or AIC.

For noisy or stochastic dynamics and observations, equation \eqref{eq:ssp}
is not appropriate.  Instead, the $y$s are functions of a Markov process
and are characterized by the conditional densities\footnote
   {Our notation blurs the distinction between probability distributions of
   discrete variables and probability densities of continuous variables.
   For the probability at $x$, we use the notation $P(x)$,
   for the function, we use $P_{x}$, i.e., we use subscripts
   to specify a function and parentheses to denote the value of a
   function at a point.  We resolve ambiguities such as $P(5.23)$ by
   using subscripts or $\argeq$, e.g., $P_x(5.23)$ or $P(x\argeq 5.23)$.
   We attempt to balance opacity and imprecision in our notation.}
$P_{\z(t\argplus 1)|\z(t)}$ and $P_{y(t)|\z(t)}$ with
%
\begin{subequations}
\label{eq:stochastic}
  \begin{align}
    P\left( \z(t\argplus 1)|\z_{-\infty}^t,y_{-\infty}^t \right) &=
    P\left( \z(t\argplus 1)|\z(t)\right) \\
    P\left(y(t)|\z_{-\infty}^t,y_{-\infty}^{\infty}\right) &=
    P\left(y(t)|\z(t)\right).
  \end{align}
\end{subequations}
%
Given these conditional density functions and a natural measure or
stationary density $\mu$ with
$\mu(\z') = \int P_{\z(t\argplus 1)|\z(t)}\left(\z'|\tilde \z \right)
\mu(\tilde \z) d \tilde \z$,
forecasting formally reduces to iterating a recursion
\begin{subequations}
\label{eq:kalman}
  \begin{align}
    P \left( y(T \argplus 1)\right|y_1^T) & = 
    \int P\left(y(T\argplus 1)|\z(T\argplus 1)\right)\, 
    P\left(\z(T\argplus 1)|y_1^T\right)\, d\z(T\argplus 1)\\
    P\left(\z(T\argplus 1)|y_1^T\right) & =  \int
    P\left(\z(T\argplus 1)|\z(T)\right)\,
    P\left(\z(T)|y_1^T\right)\, d\z(T) \\
    P\left(\z(T)|y_1^T\right) & =  \frac{P\left(y(T)|\z(T)\right)\,
      P\left(\z(T)|y_1^{T-1}\right)}{P\left(y(T)|y_1^{T-1}\right)}
  \end{align}
\end{subequations}
starting with $P(\z(1)) = \mu$.  The key idea is that in state space
there is an evolving cloud of locations that are consistent with
observations up to the time $t$, i.e., $P\left(\z(t)|y_1^t\right)$.
If the conditional densities of equation \eqref{eq:stochastic}
correspond to adding Gaussian noise to equation \eqref{eq:ssp},
they can be written as
\begin{subequations}
\label{eq:gauss}
  \begin{align}
    P({\z(t\argplus 1)|\z(t)}) &=
    \left[\frac{\left|\Sigma^{-1}\right|}{\sqrt{2\pi}}\right]^n
    e^{\frac{-\left[\z(t\argplus 1) - F(\z(t))\right]  \Sigma^{-1}
        \left[\z(t\argplus 1) - F(\z(t))\right]}{2}}  \\
    P({y(t)|\z(t)})   &= \frac{1}{\sqrt{2\pi\sigma_y^2}}
    e^\frac{\left[ y(t) - G(\z(t)) \right]^2}{2\sigma^2}
  \end{align}
\end{subequations}
where $\Sigma^{-1}$ is an inverse covariance matrix.
Further, if the functions $F$ and $G$ are linear, the recursion
of equations \eqref{eq:kalman} constitutes Kalman filtering\footnote
   {Sorenson\cite{Sorenson70} observes that the basic ideas go back to Gauss.}.

The simplicity of the equations \eqref{eq:kalman} obscures the following
difficulties:
\begin{itemize}
\item The conditional densities that specify a stochastic process may be
complex requiring descriptions with infinite numbers of parameters.
\item The derived intermediate terms like $P\left(\z(t)|y_1^t\right)$
are not simply values, but functions.
\item And most importantly, the conditional densities that specify a
stochastic process must be estimated on the basis of observations alone.
\end{itemize}
Rather than attempting to estimate the ``true'' stochastic process
on the basis of observations, in view of these difficulties, we
reconsider our pragmatic goals.  For forecasting, we are
interested only in the $y$s, the hidden $\z$s are just computational
intermediates.  Consequently, we consider a {\em model} of the process
to be a sequence of probability density functions
$P_{y_1^t}:\REAL^t\rightarrow\REAL, \{t=1,2,3,\ldots\}$.  In selecting
model classes and fitting their parameters, our goal is
to obtain a set $\left\{P_{y_1^t}:~\forall t \in \INTEGER^+ \right\}$
or equivalently
$\left\{P_{y(t\argplus 1)|y_1^t}:~\forall t \in \INTEGER^+ \right\}$
that performs well in our application (forecasting) rather
than discovering a ``true'' generating mechanism\footnote
    {Our view of forecasting has been influenced by Williams' book on data
    compression\cite{Williams91} and indirectly by the work of
    Rissanen\cite{Rissanen89}.}.
We are still exploring several model types and not having carefully
accounted for free parameters in our comparisons, we simply use maximum
likelihood methods.

\section{Model Classes}

In this section we begin by introducing basic discrete state discrete
output HMMs.  Then we turn to mixed state models, of which the hidden
filter hidden Markov models
HFHMMs\footnote
   {We called these models autoregressive hidden Markov models
   (ARHMMs) until we found that Poritz\cite{Poritz88} had already described
   them and called them HFHMMs.}
described in sections \ref{sec:hfhmm} and \ref{sec:fba} are a special case.

Equation \eqref{eq:stochastic} describes a process in which the observable
is a probabilistic function of an underlying Markov process.  If the state
variables $\z$ and the observations $y$ are drawn from discrete finite
sets, the process is what is called a hidden Markov model (HMM).
Much HMM development work is motivated by applications in natural
language.  The original work was done in the Communications Research
Division of the Institute for Defense Analysis in Princeton and the
methods were reviewed at an open symposium in 1980.  In the proceedings
Ferguson\cite{Ferguson80} described HMMs:
\begin{quote}
   \ldots a Markov chain with state space ${\cal S}$, having $S$ states
   \ldots, a finite output alphabet, ${\cal K}$, which we may take to be
   the integers $1,2,\ldots,K$, and a collection of probability
   distributions,  Explicitly, we need a transition matrix
   $(a_{ij}),~i,j\in {\cal S}$, where
\[ a_{ij} = {\rm Prob\{next~state}=~j{\rm~given~current~state}=i \} \]
   and we need an output probability matrix
   $(b_j(k)),~j\in {\cal S}, k \in {\cal K}$, where
\[ b_j(k) = {\rm Prob\{observation}=~k{\rm~given~current~state}=j \} \]
   For completeness, we need an initial distribution on states, to get
   us started.  Let $(a(i)), i\in {\cal S}$ be this distribution.
\end{quote}
HMMs are useful because the probability distributions can be adjusted
by the Baum-Welch, or forward-backward, algorithm to
maximize the likelihood of a given set of training data.  We describe
the version of the algorithm needed for HFHMMs in section \ref{sec:fba}.

The following points about discrete HMMs merit emphasis:
\begin{enumerate}
\item Although the hidden process is first order Markov, the output
process may not be Markov of any order. \label{noX}
\item Even if the dynamics and observations (the functions $F$ and $G$
in equation \eqref{eq:gauss}) are nonlinear, a discrete HMM can
approximate the continuous case arbitrarily well by using large
numbers of states $S$ and possible output values $K$.
\item Larger numbers of training data are required as $S$ and $K$ are
increased. \label{point:large}
\end{enumerate}
As an illustration of point \ref{noX} consider the process
described by
\[
a = \left[ \begin{array}{cccc} .9&.1&0&0 \\ 0&0&1&0 \\ 0&0&.9&.1 \\ 1&0&0&0
   \end{array} \right] \hspace{2cm}
b=\left[\begin{array}{ccc} 1&0&0 \\ 0&1&0 \\ 1&0&0 \\ 0&0&1 \end{array} \right]
\]
which produces output strings with runs of about seven $1$s
interspersed with occasional $2$s and $3$s.  In the output stream,
the $2$s and $3$s alternate no matter how many $1$s fall in between.
Such behavior can not be captured by a simple Markov process of any
order.

A discrete HMM fit to continuous observations of continuous dynamics
(e.g.\ equation \eqref{eq:gauss}) disregards useful properties of the
data.  The large number of training data in point \ref{point:large}
above can be reduced by preserving a measure of nearness; parameters
for situations that
do not occur in the training data can be fit on the basis of
{\em interpolations} of nearby situations that do.  To build HMM--like
models that interpolate in this sense, we introduce the notion of a
{\em mixed state} $\psi(t) = (s(t),\x(t))$ which consists of a
discrete part $s\in \{1,2,\ldots,n_{\rm states}\}$ and a continuous
part $\x\in\REAL^n$.

%
\begin{figure}[t]
  \centering \includegraphics[width=11cm]{fig1.eps}
\caption[Model Classes]{
%
Chains of dependency in
various model classes.  Solid lines indicate deterministic dependence
and dotted lines indicate stochastic influence.  In each sketch time
advances one step with the subsequent conditions appearing above the
prior conditions.  Given the depicted influences on a node, earlier
values of all variables are irrelevant; thus the {\bf HMM} sketch indicates
$P(y(t)|s_1^t,y_1^{t\argminus 1}) = P(y(t)|s(t)$ and
$P(s(t)|s_1^{t-1},y_1^{t-1}) = P(s(t)|s(t\argminus 1)$.
An {\bf AR} model can be written as
$P(y(t)|\x(t)) =
\frac{1}{\sqrt{2\pi\sigma^2}}\exp-\frac{(y(t)-\a\cdot\x(t))^2}{2\sigma^2}$
with $\x(t) = F(x(t\argminus 1),y(t\argminus 1))$ defined by
$\x_1(t) = y(t\argminus 1)$
and $\x_i(t) = \x_{i\argminus 1}(t\argminus 1):\, 1 < i \leq m$.
In the sketch, the solid lines indicate the function $F$ and
the dotted lines indicate $P_{y(t)|\x(t)}$.  In a piecewise linear
({\bf PWL}) model, the history vector $\x$ determines the partition
element $s$ and thus the linear rule $\hat y(t) = \a_s \cdot \x(t)$,
then $P_{y(t)|\x(t),s(t)}$ is Gaussian with mean $\hat y(t)$ and
variance $\sigma_s^2$.  {\bf HFHMM}s are discussed in sections
\protect\ref{sec:hfhmm} and \protect\ref{sec:forecast} and type 1
mixed state hidden Markov models ({\bf MSHMM1}) are discussed in
section \protect\ref{sec:forecast}.
%
}
\label{fig:sketch}
\end{figure}
%

As a simplifying assumption, we let the
continuous part be a deterministic function of past observations,
i.e., $\x(t) = {\rm Funct.}(y_1^{t-1})$.  The mixed states are meant to
summarize histories; thus we assume that
$P(y(t)|\psi(t),y_1^{t-1}) = P(y(t)|\psi(t))$ and
\begin{equation}
P(y(t)|y_1^{t-1}) = \sum_{s(t)} P(y(t)|\psi(t)) \, P(\psi(t)|y_1^{t-1}).
\label{eq:sum}
\end{equation}
By putting all of the uncertainty about location into the
discrete part of a mixed state, the integral in equation
\eqref{eq:kalman}a has been simplified to the sum in equation \eqref{eq:sum}.
While we doubt natural time series are actually generated by mixed
state processes, we use them as models because they achieve such
operational simplifications and their observable aspects, i.e., 
$\left\{P_{y(t\argplus 1)|y_1^t}:~\forall t \in \INTEGER^+ \right\}$,
provide high likelihood fits to complex behavior using relatively
few free parameters.

In our discussions as we develop models and write programs to implement
them, we have found sketches like those in fig.\ \ref{fig:sketch}
helpful.

\section{Incomplete Data: The EM Algorithm}
\label{sec:em}

In this section we review the EM algorithm\footnote
   {Our development follows the 1970 paper by Baum {\em et al.}\cite{Baum70}.
   In a 1977 paper Dempster Laird and Rubin\cite{Dempster77} called the
   procedure the {\em estimate maximize} algorithm.  We recommend
   Brown's dissertation\cite[page 25]{Brown87} for clarity on the
   subject and Poritz\cite{Poritz88} for a thorough bibliography and
   historical outline.}
which adjusts model parameters $\btheta$ to maximize the likelihood
of observations $\y$.  It operates on models\footnote
   {For our application, $\y$ is a sequence of observations
   $(y(1),y(2),\ldots,y(T))$ and $\s$ is a sequence of discrete hidden
   states $\left(s(1),\ldots,s(T)\right)$.  }
which include unobserved data $\s$, $P_{\y,\s,\btheta}$.
In section \ref{sec:hfhmm} we describe a specific model for time series,
and in section \ref{sec:fba} a version of the EM algorithm tailored for
that model is presented.
The steps in any EM algorithm are:
\begin{enumerate}
\item Guess a starting value of $\btheta$.
\item \label{loop} Choose $\thetahat$ to maximize\footnote
   {For discrete $s$, this notation means
   $\bra f \ket = \sum_\s P_{\s|\y,\btheta}(\s|\y) f(\s)$.}
$Q(\theta,\thetahat) \equiv \bra\log P_{\y,\s,\thetahat}(\y,\s) \ket$
\item Set $\btheta = \thetahat$.
\item If not converged, go to \ref{loop}.
\end{enumerate}
This procedure will work if,
\begin{subequations}
\label{eq:imp}
  \begin{align}
    \label{eq:impa}
    Q(\theta,\thetahat) > Q(\theta,\theta) &\Rightarrow
    P_\thetahat(\y) >P_\theta(\y) \\
    &{\rm and}& \nonumber \\
    \label{eq:impb}
    \max_\thetahat Q(\theta,\thetahat) = Q(\theta,\theta) &\Rightarrow
    \max_\thetahat P_\thetahat(\y) = P_\theta(\y).
  \end{align}
\end{subequations}

The truth of the first implication \eqref{eq:impa} is shown as follows:
\[
P_{\thetahat}(\y) = \frac{P_{\thetahat}(\s,\y)} {P_{\thetahat}(\s|\y)}
\]
\[
\log P_{\thetahat}(\y) = \log P_{\thetahat}(\s,\y) - \log P_{\thetahat}(\s|\y)
\]
Note $\bra \log P_{\thetahat}(\y) \ket = \log P_{\thetahat}(\y)$
because $\s$ does not appear inside the $\bra \ket$.  So
\begin{equation}
\log P_{\thetahat}(\y) = \bra  \log P_{\thetahat}(\s,\y) \ket
- \bra  \log P_{\thetahat}(\s|\y) \ket
\label{eq:star}
\end{equation}
The Gibbs inequality\footnote
        {While many information theory texts attribute this inequality
        to Kullback and Leibler\cite{Kullback51}, it appeared 50 years
        earlier in chapter {\em XI} {\em Theorem II} of Gibbs\cite{Gibbs}.}
for two distributions $P_{\theta_1}$ and $P_{\theta_2}$ says
\[
\sum_x P_{\theta_1}(x) \log \frac{P_{\theta_2}(x)}{P_{\theta_1}(x)} \leq 0
~~{\rm or}~~
\left< \log P_{\theta_2}(x)\right>_{\theta_1} \leq
\left< \log P_{\theta_1}(x)\right>_{\theta_1}
\]
So
\[
\bra  \log P_{\thetahat}(\s|\y) \ket \leq
\bra  \log P_{\btheta}(\s|\y) \ket
\]
Now if $\thetahat$ is chosen so that
\begin{equation}
\bra  \log P_{\thetahat}(\s,\y) \ket >
\bra  \log P_{\btheta}(\s,\y) \ket
\label{eq:sq}
\end{equation}
equation (\ref{eq:star}) yields the implication \eqref{eq:impa}
\[
P_{\thetahat}(\y) > P_{\btheta}(\y)
\]
and the algorithm steps uphill.

The second implication, \eqref{eq:impb}, does not always hold.
Since
\[
\left[ \frac{\partial}{\partial \thetahat}
\bra  \log P_{\thetahat}(\s,\y) \ket\right]_{\thetahat = \theta} =
\left[\frac{1}{P_{\thetahat}(\y)}
\frac{\partial}{\partial \thetahat}
P_{\thetahat}(\y)\right]_{\thetahat = \theta},
\]
and
\[
\left[
\frac{\partial^2}{\partial \thetahat^2}
\bra  \log P_{\thetahat}(\s,\y) \ket \right]_{\thetahat = \theta}
= \left[ \frac{1}{P_{\thetahat}(\y)}
\frac{\partial^2}{\partial \thetahat^2}
P_{\thetahat}(\y) - \bra \left(
\frac{\partial}{\partial \thetahat}
\log P_{\thetahat}(\s,\y) \right)^2 \ket
\right]_{\thetahat = \theta}
\]
the critical points of $\bra  \log P_{\theta}(\s,\y) \ket$
and $P_{\theta}(\y)$ are the same, but maxima of the former may
be saddle points of the latter.  But since their basins of attraction
are low dimensional stable manifolds, it is unlikely that the algorithm
will get stuck at a saddle point of $P_{\theta}(\y)$.

\section{Hidden Filter HMMs (HFHMMs)}
\label{sec:hfhmm}

HFHMMs are a class of time series models to which the EM algorithm can
be applied.  They are diagrammed in fig.\ \ref{fig:sketch} and consist
of a hidden first order Markov process on a set of discrete states
$\left\{ s_1,s_2,\ldots,s_{\rm nstates}\right\}$ and associated with
each discrete state is a linear autoregressive output process.  The
conditional transition probabilities from any state $k$ are given by
$P(s(t \argplus 1)|s(t) \argeq k)$,
and the output distribution at time $t$ given
state $s(t) = k$ and history
$\x = \left( y(t\argminus 1),y(t-2),\ldots,y(t-m)\right)$ is
\[
P_{y(t)|s(t),\x(t)}\left(y|k,\x\right) = \frac{1}{\sigma_{k}\sqrt{2\pi}}
\exp -\frac{\left( y - \ybar_{k} - \a_{k} \cdot \x \right)^2}
{2 \sigma_{k}^2}.
\]
A model $\btheta$ consists of all of the
parameters for all of the states; for each state $k$ the parameters
are: the transition probabilities $P_{s(t\argplus 1)|s(t)}(j|k)$, and
the output distribution parameters $\ybar_k$, $\a_k$, and $\sigma_k$.
We impose the following constraints:
\begin{itemize}
\item Discrete state transitions are Markov and independent of prior
   outputs\footnote
      {This may be appropriate if ``the noise scale is at least as large as the
      discrete states'', but it is a bad approximation for noise free
      deterministic dynamics.}
   \begin{equation}
   P(s(t)|s_1^{t-1},y_1^{t-1}) = P(s(t)|s(t\argminus 1)),
   \label{eq:sind}
   \end{equation}
   and $s(1)$ and $\x(1)$ are independent
   \begin{equation}
   P(s(1),\x(1)) = P(s(1))\, P(\x(1)).
   \label{eq:s1x1ind}
   \end{equation}
   Using a notation in which $s_q(t)$ is the $t^{\rm th}$ element in the
   sequence of states $q$, i.e., $q \equiv (s_q(1),s_q(2),\ldots,s_q(T))$,
   we can write
   \[
   P(q) = P(s_q(1)) \prod_{t=2}^{T} P(s_q(t)|s_q(t\argminus 1)).
   \]
\item Given $\x(t)$ and $s(t\argminus 1)$, earlier values of $s$
    and $y$ are irrelevant,
    \begin{equation}
    P(y(t),s(t)|y_1^{t-1},s_1^{t-1}) = P(y(t),s(t)| s(t\argminus 1), \x(t)),
    \label{eq:jmarkov1}
    \end{equation}
    which with eqn.\ \eqref{eq:sind} implies
    \begin{equation}
    P(y(t),s(t)|s(t\argminus 1),\x(t)) =
        P(y(t)|s(t),\x(t))P(s(t)|s(t\argminus 1)).
    \label{eq:jmarkov2}
    \end{equation}
\end{itemize}

Given $y_1^T$, a sequence of observations for training, the EM
algorithm adjusts the model parameters $\theta$ to maximize the
likelihood which can be evaluated as
\[
P_\theta\left(y_1^T\right) = \sum_q P_\theta\left(y_1^T,q\right),
\]
where the variable $q$ runs over all possible state sequences.
Step \ref{loop} of the algorithm prescribes selecting new parameters
$\thetahat$ to maximize the expected log likelihood
$\bra\log P_{y_1^T,q,\thetahat}(y_1^T,q) \qket$, where the expectation
is with respect to the conditional distribution $P_\btheta(q|y_1^T)$
based on the old parameters $\theta$.  We assume $\x(1)$ is available\footnote
        {We also drop $P_{\x(1)}(\x(1))$ in all calculations, i.e., set
        $P_{\x(1)}(\x(1)) = 1$.},
and use the assumptions to write:
\begin{eqnarray*}
P_\thetahat(y_1^T,q) & = & P_\thetahat(y(1),s_q(1)|\x(1))\,
    \prod_{t=2}^T P_\thetahat(y(t),s_q(t)|s_q(t\argminus 1),\x(t\argminus 1))\\
% & = & P_\thetahat(y(1)|s_q(1),\x(1))\, P_\thetahat(s_q(1))\,
%    \prod_{t=2}^T P_\thetahat(s_q(t)|s_q(t\argminus 1))\,
%    \prod_{t=2}^T P_\thetahat(y(t)|s_q(t),\x(t)) \\
 & = & P_\thetahat(s_q(1))\,
    \prod_{t=2}^T P_\thetahat(s_q(t)|s_q(t\argminus 1))\,
    \prod_{t=1}^T P_\thetahat(y(t)|s_q(t),\x(t))\\
\log P_\thetahat(y_1^T,q) & = & \log P_\thetahat (s_q(1))
+ \sum_{t=1}^{T-1} \log P_\thetahat(s_q(t\argplus 1)|s_q(t))
\\
 & & + \sum_{t=1}^T \log P_\thetahat (y(t)|s_q(t),\x(t))
\\
 & = &  \log P_\thetahat (s_q(1))
+ \sum_{t=1}^{T-1} \log P_\thetahat(s_q(t\argplus 1)|s_q(t))
\\
 & & - \sum_{t=1}^T \left\{
        \log \sigma_{s_q(t)}
        + \frac{1}{2} \log 2\pi
        + \frac{\left(y(t)-\ybar_{s_q(t)}-\a_{s_q(t)}\cdot\x(t)\right)^2}
        {2 \sigma_{s_q(t)}^2} \right\}
\end{eqnarray*}

To proceed with the optimization formally, we need $P_\theta(q|y_1^T)$.
If we use the notation $w(q) \equiv P_\theta(y_1^T,q)$, then
$P_\btheta(q|y_1^T) = w(q)/W$, where $W \equiv \sum_{q'} w(q')$.
The number of terms in
$\sum_{q'} w(q')$ depends exponentially on $T$, precluding a direct
evaluation for $T$s large enough to be interesting, but the sum {\em can}
be evaluated by the {\em forward-backward algorithm} which is {\em
linear} in $T$.  We will describe the algorithm in section \ref{sec:fba},
but first we write out expressions for the required optimization.
\begin{eqnarray*}
\lefteqn{W \bra\log P_\thetahat (y_1^T,q) \qket =
\sum_q w(q) \log P_{s(1),\thetahat}(s_q(1))}\\
& & + \sum_{q,t=1}^{T-1} w(q) \log P_\thetahat (s_q(t\argplus 1)|s_q(t))
- \frac{W}{2}\log 2 \pi
\\
& & - \sum_{q,t=1}^T w(q) \left\{\log \sigma_{s_q(t)}
        + \frac{\left(y(t)-\ybar_{s_q(t)}-\a_{s_q(t)}\cdot\x(t)\right)^2}
        {2 \sigma_{s_q(t)}^2} \right\}
\end{eqnarray*}
and the maximization can be done separately by maximizing the term
\begin{equation}
F(\thetahat) \equiv
\sum_q w(q) \log P_\thetahat(s(1)\argeq s_q(1)) +
\sum_{q,t} w(q) \log P_\thetahat(s_q(t\argplus 1)|s_q(t))
\label{eq:F}
\end{equation}
and minimizing the term
\begin{subequations}
\label{eq:g}
  \begin{align}
    G(\thetahat) & \equiv  \sum_{q,t} w(q) \left\{
      \frac{\left(y(t)-\ybar_{s_q(t)}-\a_{s_q(t)}\cdot\x(t)\right)^2}
      {2 \sigma_{s_q(t)}^2} + \log \sigma_{s_q(t)} \right\}\\
    & \equiv  \sum_{q,t} w(q) g(\thetahat_{s_q(t)},y(t),\x(t)).
  \end{align}
\end{subequations}
In equation (\ref{eq:g}) $\thetahat_s$ refers to the parameters in the
model that are associated with the state $s$, i.e., $\ybar_s$, $\a_s$,
and $\sigma_s$.  We convert the sum over $q$ and $t$ to a sum over $s$ and $t$:
\begin{subequations}
\label{eq:wst}
  \begin{align}
    G(\thetahat) & =  \sum_{s,t} \left\{ g(\thetahat_s,y(t),\x(t))
      \sum_q w(q) \delta_{s,s_q(t)} \right\}\\
    & \equiv  \sum_{s,t} g(\thetahat_s,y(t),\x(t)) w(s,t).
  \end{align}
\end{subequations}
The function $w(s,t)$ introduced in equation (\ref{eq:wst}) is the total
probability, considering all possible paths $q$, that the system is in
state $s$ at time $t$ and that the sequence of outputs $y_1^T$ is produced
by the model, i.e.,
\[
w(s,t) \equiv P_\theta(s(t),y_1^T).
\]
In section \ref{sec:fba} we describe how to calculate $w(s,t)$ using the
forward-backward algorithm.
Given $w(s,t)$, finding new values for $\ybar_s$, $\a_s$, and $\sigma_s$
is fairly standard linear fitting; solve for $\a_s$ and $\ybar_s$ by
using the SVD method\footnote{
        See equation 14.3.16 on page 535 of Press {\em et al.}\cite{Press88}}
to minimize
\[
\chi^2 = \sum_t \left\{
y(t)\sqrt{w(s,t)}  - (\ybar_s + \x(t)\cdot \a_s) \sqrt{w(s,t)} \right\}^2,
\]
and, defining $W(s) = \sum_t w(s,t)$, set
\[
\sigma_s = \sqrt{ \frac{1}{W(s)}
        \sum_t { w(s,t) \left( y(t) - \hat y(t) \right) ^2 } }.
\]
Introducing the notation
\[
w(i,j,t) \equiv P_{s(t\argplus 1),s(t),y_1^T,\theta}(i,j,y_1^T),
\]
and denoting the new discrete transition
probabilities $f_{ij} \equiv P_{s(t\argplus 1)|s(t),\thetahat}(i|j)$, we
observe that optimizing eqn.\ (\ref{eq:F}) requires maximizing
\[
F_{ij} \equiv \sum_{i,t} w(i,j,t) \log(f_{ij}) {\rm ~~ subject ~ to: ~~}
\sum_i f_{ij} = 1.
\]
The Lagrange multiplier method yields
\[
f_{ij} \propto \sum_t w(i,j,t).
\]
Selecting the new $P_{s(1),\thetahat}(s)$ is a similar problem, and the
solution is given in equation (\ref{eq:P0}) of the next section.

\section{The Forward-Backward Algorithm}
\label{sec:fba}

The forward-backward algorithm is an EM algorithm specifically for
time series.  The first steps of the algorithm are two passes
through the time series: One ``forwards'' from $t=1$ to $t=T$ to
calculate $\alpha$s, and the other ``backwards'' from $t=T$ to $t=1$
to calculate $\beta$s.  The factors $w(s,t)$ and $w(i,j,t)$ used in the
previous section, can be evaluated in terms of these $\alpha$s and
$\beta$s which are defined as follows:
\begin{description}
\item[$\alpha(s,t)$] The probability, based on the model, of the
   observations up to time $t$ and that the system is in state $s$ at
   time $t$:
\[
\alpha(s,t) \equiv P(y_1^t,s(t))
\]
\item[$\beta(s,t)$] The probability, based on the model, of the
   observations after time $t$ given that the system is in state $s$ at
   time $t$ and given the previous observations
   $y_1^t$:
\[
\beta(s,t) \equiv P(y_{(t\argplus 1)}^T|s(t),y_1^t)
\]
\end{description}
These definitions and equations \eqref{eq:jmarkov1} and \eqref{eq:jmarkov2}
yield
\[
w(s,t) = \alpha(s,t) \beta(s,t),
\]
\[
w(i,j,t) = \alpha(j,t)\,
P(y(t\argplus 1)|s(t\argplus 1)\argeq i,\x(t\argplus 1))\,
P(s(t\argplus 1)\argeq i|s(t)\argeq j)\, \beta(i,t\argplus 1),
\]
and the recursion formulas
\[
\alpha(s,t) = \sum_j \alpha(s_j,t\argminus 1)\,
P(s(t)\argeq s|s(t\argminus 1)\argeq s_j)\, P(y(t)|s(t)\argeq s,\x(t)),
\]
\[
\beta(s,t) = \sum_j \beta(s_j,t\argplus 1)\,
      P(s(t\argplus 1)\argeq s_j|s(t)\argeq s)\,
      P(y(t\argplus 1)|s(t\argplus 1)\argeq s_j,\x(t\argplus 1)).
\]
Note:\begin{enumerate}
\item For the new model, the initial state probabilities are
\begin{equation}
P_{s(1),\thetahat}(s) \propto P_{s(1),y_1^T,\theta}(s,y_1^T) =
\alpha(s,1)\beta(s,1)
\label{eq:P0}
\end{equation}
subject to normalization.
\item The overall likelihood of the observations
can be evaluated after a forward pass via:
\[
P_\btheta(y_1^T) = \sum_s \alpha(s,T)
\]
\item The forward recursion is initialized by:
\[
\alpha(s,1) = P(s(1)\argeq s)\, P(y(1)|s(1)\argeq s,\x(1))
\]
\item The backward recursion is initialized by:
\[ \beta(s,T) = 1 \]
\end{enumerate}

\subsection{Programming Tricks}
If
\[ \gamma(t) \equiv \sum_s \alpha(s,t) = P(y_1^t), \]
and the process has an
entropy rate $h$, then
\[ \gamma(t) \approx e^{-ht}, \]
and something must be done to prevent under(over)flow for even moderate
values of $t$.  The trick is: at each step in the forward recursion
record only $a(t)$ and $c(t)$, and at each step in the backward
recursion record only $b(t)$, where
\[ c(t) \equiv \frac{\gamma(t)}{\gamma(t\argminus 1)} {\rm ~~ or ~~}
\gamma(t) = \prod_{\tau=1}^t c(\tau) , \]
\[ a(j,t) = \frac{\alpha(j,t)}{\gamma(t)} = P(s(t)\argeq j|y_1^t), \]
and
\[ b(s,t)  = \frac{\beta(s,t) \gamma(t)}{\gamma(T)}. \]
Thus for any $t$
\[ a(s,t)b(s,t) = \frac{ \alpha(s,t) \beta(s,t) }{ \gamma(T) } \propto w(s,t) \]
and
\[ \frac{a(j,t)b(i,t\argplus 1)}{c(t\argplus 1)} P(y(t\argplus 1)|s(t\argplus 1)\argeq i,\x(t\argplus 1))
P(s(t\argplus 1)\argeq i|s(t)\argeq j) \propto w(i,j,t). \]

In each iteration of the forward-backward algorithm, there is a loop over
discrete states in which the parameters for each state $\theta_s$ are
reestimated.  The new estimates are in part based on the weights
$\{w(s,t):t=1,\ldots,T\}$.  Because the Gaussians used in the models have
tails that go on forever, $w(s,t) > 0, \forall t$.  The times $t$ with
weights below a small threshold $w(s,t) < \epsilon$ have little effect on
the new parameters $\theta_s$, and discarding these times speeds
the computations.

\section{Forecasts of Data Set D}
\label{sec:forecast}

We have written a family of programs that construct and optimize HFHMMs.
To seed the forward-backward algorithm we construct a HFHMM based on
a partition of the space of
autoregressive history vectors that we generate by Lloyd iteration\footnote
   {For details on vector quantization, see Gersho and Gray's recent
   text\cite{GershoVQ}}.
Specifying a quantization vector $v_s$ and metric $d_s()$ for each cell
$s$ defines the partition.  An autoregressive history vector $\x$ is in
the cell $s$ which minimizes $d_s(v_s,\x)$.  In each cell, we set the
metric proportional to the inverse covariance of the data in the cell.
To construct the seed HFHMM, we associate a hidden state with each
cell of the partition, initialize the parameters of $P_{y(t)|s(t),\x(t)}$
with a linear fit over training data that fall in the cell,
and use relative frequencies of transitions between cells to estimate
discrete state transition probabilities.
%
\begin{figure}
\centering \includegraphics[bb=42 0 314 349,width=10.7cm]{graph3.ps}
\caption{
Plot $a$ is a forecast of data set D generated by a HFHMM.  The model has
25 discrete states and the autoregressive filters are $8^{\rm th}$ order.
A MSHMM1 with 20 discrete states and $8^{\rm th}$ order autoregressive
filters generated the forecast in plot $b$.
Plots $c$ and $d$ illustrate the long-term behavior of the models.
Each is a prediction of the distribution of $y$ one hundred steps in the
future.  The predictions relax to distributions that are close to the
overall distribution of data set D, which is plotted in $e$.}
\label{fig:far}
\end{figure}
%

We used this procedure to fit the model that generated fig.~\ref{fig:far}$a$,
a forecast of data set D.  The model is the result of 70 passes of the
forward-backward algorithm, each of which required about 45 minutes
%2453.3u 1189.3s 21:55:29 4% 0+20324k 29+4io 1251659pf+0w (step 71 on 100K,orr)
% steps 1 - 70 done on ursula on 100K
on a SPARCstation 2.  We estimated the probability density that constitutes
the forecast using a Monte Carlo method.

As the number of time steps is increased in very long forecasts, probability
leaks away to exponentially larger values of $y$.  Although this effect
is subtle in fig.~\ref{fig:far}$a$, for simple chaotic systems it is
dramatic\cite{Dimitriadis93}.  Considering the Lyapunov exponents
helps explain this defect.  A discrete state sequence $q$, starting at
the end of the observed data $T$ and continuing for a forecast of $\tau$ steps
$q\equiv\left(s_q(T\argplus1),s_q(T\argplus2),\ldots,s_q(T\argplus\tau)\right)$,
specifies a sequence of linear maps, i.e., derivative information.  For typical
long sequences $q$, the magnitudes of the eigenvalues of the products of
these maps will grow at exponential rates given by the Lyapunov exponents.
For chaotic systems, at least one Lyapunov exponent is positive, and
the composed maps are linearly, and hence globally, unstable.  The problem
is that the model is linear in the sense that
\[
y_{T,q,\lambda \x(T)}^{T+\tau} = \lambda y_{T,q,\x(T)}^{T+\tau} 
\]
and
\[
P_\theta (\lambda y_{T,q,\lambda\x(T)}^{T+\tau}|q, \lambda \x(T)) =
P_\theta (y_{T,q}^{T+\tau}|q, \x(T))
\]
where $y_{T,q,\x(T)}^{T+\tau}$ denotes the $y$ sequence that maximizes
$P_\theta (y_{T+1}^{T+\tau}|q,\x(T))$.  Thus, in a HFHMM there are no
nonlinearities to saturate
diverging $y$ values.  The models described below
address this weakness.

%
\begin{figure}[thbp]
\centering \includegraphics[bb=38 0 310 349,width=8.4cm]{page4.ps}
\caption{
These plots illustrate the short-range behavior of the models.  The
first few time steps of the forecasts of figure \protect\ref{fig:far}
appear in plots $a$ and $b$.   For each step, the probability density
is sketched and a horizontal bar indicates the {\em true} continuation data.
The squares in $c$ denote the peaks of the predictions in $b$, and the
line connects the points of the true continuation data.}
\label{fig:near}
\end{figure}
%
\subsection{Output dependent state transitions}
\label{sec:mshmm}
By allowing $\x$ values to influence the transition probabilities between
discrete states, one can introduce the nonlinear saturation that HFHMMs
miss.  In such models, the sequence $(s(t), \x(t))$ still constitutes a
Markov process, but the sequence of discrete states $s(t)$ alone does not.

The assumptions we now make are:

\begin{description}
\item[(a)] $P(s(t\argplus 1) \mid s_1^t, y_1^t) = P(s(t\argplus 1) | s(t), 
y(t), \x(t))$
\item[(b)] $P(y(t) \mid  s_1^t, y_1^{t-1}) =  P(y(t) | s(t), \x(t))$
\item[(c)] $P(y(t), \x(t) \mid s(t), s(t\argplus 1))$ has a multivariate normal
distribution.
\item[(d)] The entire process is stationary.
\end{description}
This type of model is referred to in the figures as MSHMM1.
Simpler model types assume that $P(y(t) \mid s(t), \x(t))$ and
$P(\x(t\argplus 1)|s(t\argplus 1),s(t))$ are normal and that
$P(s(t\argplus 1) \mid s_1^t, y_1^t) = P(s(t\argplus 1) | s(t), \x(t\argplus 1))$.

Under the above assumptions $\psi(t) = (s(t), \x(t))$ is a Markov
process.  It is
possible to compute all quantities of interest by maintaining, for each
transition $s(t) \to s(t\argplus 1)$, the parameters of a multivariate normal
distribution $P(\x(t) \mid s(t), s(t\argplus 1))$, a normally-distributed linear
prediction $P(y(t) | s(t), s(t\argplus 1), \x(t))$, and the constant
$P(s(t\argplus 1) \mid s(t))$.

The EM algorithm is applicable to this class of models as well.  The
transition probabilities in the mixed state space are
\begin{eqnarray*}
\lefteqn{
P(\psi(t\argplus 1)\mid \psi(t)) = P(\x(t\argplus 1) \mid s(t\argplus 1),
s(t), \x(t))
P(s(t\argplus 1) \mid s(t), \x(t))} \\
&=&P(y(t) \mid s(t\argplus 1), s(t), \x(t)) { P(\x(t) \mid s(t), s(t\argplus 1))
P(s(t\argplus 1) \mid
s(t)) \over P(\x(t) \mid s(t)) }
\end{eqnarray*}
The EM algorithm is equivalent to maximizing
$ \left<\sum_t \log P(\psi(t\argplus 1) \mid \psi(t)) \right>_{q|y_1^T}$.
While we do not yet have working code that maximizes this,
we have obtained good results using a naive algorithm which maximizes 
$\left<\sum_t\log P(y(t)\mid s(t\argplus 1),s(t),\right.$
$\left.\x(t))\right>_{q|y_1^T}$,
$ \left<\sum_t \log P(\x(t) \mid s(t\argplus 1), s(t)) \right>_{q|y_1^T}$, and
$ \left<\sum_t \log P(s(t\argplus 1) \mid s(t)) \right>_{q|y_1^T}$, but
ignores the denominator term,
$ -\left<\sum_t \log P(\x(t) \mid s(t)) \right>_{q|y_1^T}$.  
Application of this naive algorithm yields essentially monotonic improvement in
performance, with each model very close to the true optimal performance
for that step in the process.

Predictions made by such a model trained on data set D appear in
figs.~\ref{fig:far}$b$, ~\ref{fig:near}$b$, and ~\ref{fig:near}$c$.
Even for very long forecasts there is no leakage of probability to
ever larger $y$s.

\section{Acknowledgments}
This research was supported in part by NSF grant MIP-9113460
and by a contract from Radix Inc.  Conversations with many
other researchers including Henry Abarbanel, Ronald Hughes, Cory Myers,
and Todd Leen have influenced this work.

\begin{thebibliography}{10}

\bibitem{Abarbanel89mdl}
H.D.I. Abarbanel and J.~Kadtke.
\newblock Information theoretic methods for determining the minimum embedding
  dimension for strange attractors.
\newblock {\em INLS/UCSD preprint}, May 1990.
\newblock Unpublished.

\bibitem{Baum70}
L.E. Baum, T.~Petrie, G.~Soules, and N.~Weiss.
\newblock A maximization technique occuring in the statistica analysis of
  probabilistic functions of Markov chains.
\newblock {\em The Annals of Mathematical Statistics}, 41(1):164--171, 1970.

\bibitem{Brown87}
P.F. Brown.
\newblock {\em The Acoustic-Modeling Problem in Automatic Speech Recognition}.
\newblock PhD thesis, Carnegie Mellon University, Pittsburgh, 1987.

\bibitem{Casdagli91physica}
M.~Casdagli, S.~Eubank, J.D. Farmer, and J.~Gibson.
\newblock State space reconstruction in the presence of noise.
\newblock {\em Physica D}, 51D:52--98, 1991.

\bibitem{Dempster77}
A.P. Dempster, N.M. Laird, and D.B. Rubin.
\newblock Maximum likelihood from incomplete data via the EM algorithm.
\newblock {\em J.R. Stat. Soc. B}, 39:1--78, 1977.

\bibitem{Dimitriadis93}
A.~Dimitriadis and A.M. Fraser.
\newblock Modeling double scroll time series.
\newblock {\em Journal of Circuits, Systems and Computers}, 1993.
\newblock Submitted December, 1992.

\bibitem{Ferguson80}
J.D. Ferguson.
\newblock Hidden Markov analysis: An introduction.
\newblock In {\em Proc. of the Symposium on the Applications of Hidden Markov
  Models to Text and Speech}, pages 8--15, Princeton, 1980. IDA-CRD.

\bibitem{Fraser89physica}
A.~M. Fraser.
\newblock Phase space reconstructions from time series.
\newblock {\em Physica D}, 34D:391--404, 1989.

\bibitem{GershoVQ}
A.~Gersho and R.~M. Gray.
\newblock {\em Vector Quantization and Signal Compression}.
\newblock Kluwer, Norwell MA, 1992.

\bibitem{Gibbs}
J.W. Gibbs.
\newblock {\em Elementary Principles in Statistical Mechanics Developed with
  Especial Reference to the Rational Foundation of Thermodynamics}.
\newblock Yale University Press, 1902.
\newblock Republished by Dover in 1960.

\bibitem{Kullback51}
S.~Kullback and R.A. Leibler.
\newblock On information and sufficiency.
\newblock {\em Ann. Math. Stat.}, 22:79--86, 1951.

\bibitem{Packard80}
N.~H. Packard, J.~P. Crutchfield, J.~D. Farmer, and R.S. Shaw.
\newblock Geometry from a time series.
\newblock {\em Phys. Rev. Lett.}, 45(9):712--716, Sept. 1 1980.

\bibitem{Poritz88}
A.B. Poritz.
\newblock Hidden Markov models: A guided tour.
\newblock In {\em Proc. IEEE Intl. Conf. on Acoust. Speech and Signal Proc.},
  1988.

\bibitem{Press88}
W.~H. Press, B.~P. Flannery, S.~A. Teukolsky, and W.~T. Veterling.
\newblock {\em Numerical Recipes in C}.
\newblock Cambridge University Press, Cambridge, 1988.

\bibitem{Rissanen89}
J.~Rissanen.
\newblock {\em Stochastic Complexity and Statistical Inquiry}.
\newblock World Scientific, 1989.

\bibitem{Sauer91}
T.~Sauer, J.A. Yorke, and M.~Casdagli.
\newblock Embedology.
\newblock {\em J. Stat. Phys.}, 65:579--616, 1991.

\bibitem{Sorenson70}
H.W. Sorenson.
\newblock Least-squares estimation: from Gauss to Kalman.
\newblock {\em IEEE Spectrum}, pages 63--68, July 1970.

\bibitem{Takens81}
F.~Takens.
\newblock Detecting strange attractors in turbulence.
\newblock In D.~A. Rand and L.S. Young, editors, {\em Dynamical Systems and
  Turbulence, Warwick, 1980, Lecture notes in mathematics vol. 898}, pages
  366--381, Berlin, 1981. Springer.

\bibitem{Williams91}
R.N. Williams.
\newblock {\em Adaptive Data Compression}.
\newblock Kluwer, Norwell MA, 1991.

\end{thebibliography}
\end{document}
