3 Parameter estimation techniques
for HMMs
In this section we recall basic concepts underlying parameter estimation for HMMs via direct numerical optimization of the likelihood. In terms of notation, we stay as close as possible to Zucchini, MacDonald, and Langrock (2016), where a more detailed presentation is available.
3.1 Basic notation and model setup
The conditional distribution densities can be called in R
as well as in C++
(provided that the TMB
template is respected) via
dpois(x = 1, lambda = 5) # Poisson
## [1] 0.03368973
dnorm(x = 1, mean = 0, sd = 1) # Normal
## [1] 0.2419707
Other distributions are available.
We let TPM denote the transition probability matrix, and reference it with the variable \({\boldsymbol\Gamma}\).
3.2 The likelihood function of an HMM
The Poisson HMM negative log-likelihood function in C++
can be written the following way using TMB
’s template.
#include <TMB.hpp>
#include "../functions/utils.cpp"
// Likelihood for a poisson hidden markov model.
template<class Type>
<Type>::operator() ()
Type objective_function{
// Data
(x); // timeseries vector
DATA_VECTOR(m); // Number of states m
DATA_INTEGER
// Parameters
(tlambda); // conditional log_lambdas's
PARAMETER_VECTOR(tgamma); // m(m-1) working parameters of TPM
PARAMETER_VECTOR
// Uncomment only when using a non-stationary distribution
//PARAMETER_VECTOR(tdelta); // transformed stationary distribution,
// Transform working parameters to natural parameters:
<Type> lambda = tlambda.exp();
vector<Type> gamma = gamma_w2n(m, tgamma);
matrix
// Construct stationary distribution
<Type> delta = stat_dist(m, gamma);
vector// If using a non-stationary distribution, use this instead
//vector<Type> delta = delta_w2n(m, tdelta);
// Get number of timesteps (n)
int n = x.size();
// Evaluate conditional distribution: Put conditional
// probabilities of observed x in n times m matrix
// (one column for each state, one row for each datapoint):
<Type> emission_probs(n, m);
matrix<Type> row1vec(1, m);
matrix.setOnes();
row1vecfor (int i = 0; i < n; i++) {
if (x[i] != x[i]) { // f != f returns true if and only if f is NaN.
// Replace missing values (NA in R, NaN in C++) with 1
.row(i) = row1vec;
emission_probs}
else {
.row(i) = dpois(x[i], lambda, false);
emission_probs}
}
// Corresponds to (Zucchini et al., 2016, p 333)
<Type> foo, P;
matrix, sumfoo, lscale;
Type mllk
= (delta * vector<Type>(emission_probs.row(0))).matrix();
foo = foo.sum();
sumfoo = log(sumfoo);
lscale .transposeInPlace();
foo/= sumfoo;
foo for (int i = 2; i <= n; i++) {
= emission_probs.row(i - 1);
P = ((foo * gamma).array() * P.array()).matrix();
foo = foo.sum();
sumfoo += log(sumfoo);
lscale /= sumfoo;
foo }
= -lscale;
mllk
// Use adreport on variables for which we want standard errors
(lambda);
ADREPORT(gamma);
ADREPORT(delta);
ADREPORT
// Variables we need for local decoding and in a convenient format
(lambda);
REPORT(gamma);
REPORT(delta);
REPORT(n);
REPORT(emission_probs);
REPORT(mllk);
REPORT
return mllk;
}
The case where \(m = 1\) doesn’t involve a hidden state, and thus is a Poisson regression instead of a Poisson HMM.
Nonetheless, TMB
also accelerates its estimation and may be useful to the reader.
3.3 Forward algorithm and
backward algorithm
3.3.1 Setup
3.3.1.1 Prepare the model
library(TMB)
source("functions/utils.R")
load("data/fetal-lamb.RData")
::compile("code/poi_hmm.cpp")
TMBdyn.load(dynlib("code/poi_hmm"))
<- lamb
lamb_data <- 2
m <- list(x = lamb_data, m = m)
TMB_data <- c(1, 3)
lambda <- matrix(c(0.8, 0.2,
gamma 0.2, 0.8), byrow = TRUE, nrow = m)
<- pois.HMM.pn2pw(m, lambda, gamma)
parameters <- MakeADFun(TMB_data, parameters,
obj_tmb DLL = "poi_hmm", silent = TRUE)
<- nlminb(start = obj_tmb$par, objective = obj_tmb$fn,
mod_tmb gradient = obj_tmb$gr, hessian = obj_tmb$he)
3.3.1.2 Define variables
Given an optimized MakeADFun
object obj
, we need to setup some variables to compute the probabilities detailed below.
# Retrieve the objects at ML value
<- obj_tmb$report(obj_tmb$env$last.par.best)
adrep <- adrep$delta
delta <- adrep$gamma
gamma <- adrep$emission_probs
emission_probs <- adrep$n
n <- length(delta)
m <- adrep$mllk mllk
Note that \({\boldsymbol\lambda}\) is not needed as we already have access to the emission probabilities.
3.3.1.3 Emission/output probabilities
Emission probabilities (also called output probabilities) are the conditional probabilities of the data given a state.
Using the notation in the article Section 3.1, emission probabilities are the conditional distributions \(p_i(x) = \text{P}(X_t = x \vert C_t = i) = \frac{e^{-\lambda_i} \lambda_i^x}{x!}\). We store all these in a data frame where each row and column represent a data and a state respectively.
More formally, we call \(\begin{pmatrix} p_1(x_1) & p_2(x_1) & \ldots & p_m(x_1)\\ p_1(x_2) & p_2(x_2) & \ldots & p_m(x_2)\\ \vdots & \vdots & \ddots & \vdots\\ p_1(x_n) & p_2(x_n) & \ldots & p_m(x_n)\\ \end{pmatrix}\) the emission probability matrix.
We compute these probabilities in C++
rather than in R
because it is faster and not more complicated to write
// Evaluate conditional distribution: Put conditional
// probabilities of observed x in n times m matrix
// (one column for each state, one row for each datapoint):
<Type> emission_probs(n, m);
matrix<Type> row1vec(1, m);
matrix.setOnes();
row1vecfor (int i = 0; i < n; i++) {
if (x[i] != x[i]) { // f != f returns true if and only if f is NaN.
// Replace missing values (NA in R, NaN in C++) with 1
.row(i) = row1vec;
emission_probs}
else {
.row(i) = dpois(x[i], lambda, false);
emission_probs}
}
Nevertheless, we also need to compute them in R
to Forecast
# Calculate emission probabilities
<- function(data, lambda) {
get.emission.probs <- length(data)
n <- length(lambda)
m <- matrix(0, nrow = n, ncol = m)
emission_probs for (i in 1:n) {
if (is.na(data[i])) {
<- rep(1, m)
emission_probs[i, ] else {
} <- dpois(data[i], lambda)
emission_probs[i, ]
}
}return(emission_probs)
}
3.3.2 Log-forward probabilities
We show here a way to compute the log of the forward probabilities, using a scaling scheme defined by Zucchini, MacDonald, and Langrock (2016, 66 and p.334).
# Compute log-forward probabilities (scaling used)
<- matrix(NA, m, n)
lalpha <- delta * emission_probs[1, ]
foo <- sum(foo)
sumfoo <- log(sumfoo)
lscale <- foo / sumfoo
foo 1] <- log(foo) + lscale
lalpha[, for (i in 2:n) {
<- foo %*% gamma * emission_probs[i, ]
foo <- sum(foo)
sumfoo <- lscale + log(sumfoo)
lscale <- foo / sumfoo
foo <- log(foo) + lscale
lalpha[, i] }
# lalpha contains n=240 columns, so we only display 5 for readability
1:5]
lalpha[, ## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.2920639 -0.5591179 -0.8265948 -1.094088 -1.361583
## [2,] -6.4651986 -7.7716769 -8.1146145 -8.385232 -8.652850
3.3.3 Log-backward probabilities
We show here a way to compute the log of the backward probabilities with a scaling scheme defined by Zucchini, MacDonald, and Langrock (2016, 67 and p.334).
# Compute log-backwards probabilities (scaling used)
<- matrix(NA, m, n)
lbeta <- rep(0, m)
lbeta[, n] <- rep (1 / m, m)
foo <- log(m)
lscale for (i in (n - 1):1) {
<- gamma %*% (emission_probs[i + 1, ] * foo)
foo <- log(foo) + lscale
lbeta[, i] <- sum(foo)
sumfoo <- foo / sumfoo
foo <- lscale + log(sumfoo)
lscale }
# lbeta contains n=240 columns, so we only display 4 for readability
1:4]
lbeta[, ## [,1] [,2] [,3] [,4]
## [1,] -177.2275 -176.9600 -176.6925 -176.4250
## [2,] -178.3456 -178.0781 -177.8099 -177.5253
We refer to the Section State Inference for an application of this code for state inference and forecasting.
3.4 Reparameterization of the
likelihood function
3.4.1 Utility functions in TMB
Defining the negative log-likelihood function requires transforming the working parameters into their natural form.
We define the function gamma.w2n
to perform this transformation.
The function stat.dist
handles computing the stationary distribution, while delta.w2n
can be used if a non-stationary distribution is provided.
They are defined in functions/utils.cpp
// Function transforming working parameters in initial distribution
// to natural parameters
template<class Type>
<Type> delta_w2n(int m, vector<Type> tdelta) {
vector
<Type> delta(m);
vector<Type> foo(m);
vector
if (m == 1)
return Type(1);
// set first element to one.
// Fill in the last m - 1 elements with working parameters
// and take exponential
<< Type(1), tdelta.exp();
foo
// normalize
= foo / foo.sum();
delta
return delta;
}
// Function transforming the working parameters in TPM to
// natural parameters (w2n)
template<class Type>
<Type> gamma_w2n(int m, vector<Type> tgamma) {
matrix
// Construct m x m identity matrix
<Type> gamma(m, m);
matrix.setIdentity();
gamma
if (m == 1)
return gamma;
// Fill offdiagonal elements with working parameters column-wise:
int idx = 0;
for (int i = 0; i < m; i++){
for (int j = 0; j < m; j++){
if (j != i){
// Fill gamma according to mapping and take exponential
(j, i) = tgamma.exp()(idx);
gamma++;
idx}
}
}
// Normalize each row:
<Type> cs = gamma.rowwise().sum();
vectorfor (int i = 0; i < m; i++) gamma.row(i) /= cs[i];
return gamma;
}
// Function computing the stationary distribution of a Markov chain
template<class Type>
<Type> stat_dist(int m, matrix<Type> gamma) {
vector
// Construct stationary distribution
<Type> I(m, m);
matrix<Type> U(m, m);
matrix<Type> row1vec(1, m);
matrix= U.setOnes();
U = I.setIdentity();
I .setOnes();
row1vec<Type> A = I - gamma + U;
matrix<Type> Ainv = A.inverse();
matrix<Type> deltamat = row1vec * Ainv;
matrix<Type> delta = deltamat.row(0);
vector
return delta;
}
// // Function computing log-forward probabilities (scaling used)
// template<class Type>
// vector<Type> lalpha(int n,
// // matrix<Type> lalpha(int n,
// int m,
// vector<Type> delta,
// matrix<Type> gamma,
// matrix<Type> emission_probs) {
//
// matrix<Type> lalpha(m, n);
// vector<Type> foo = delta * vector<Type>(emission_probs.row(0));
// Type sumfoo = foo.sum();
// Type lscale = sumfoo.log();
// foo /= sumfoo;
// // lalpha.row(0) << foo.log() + lscale;
//
// return foo;
// }
Transforming the Poisson means into their natural form can be done simply with the exp
function and doesn’t require a dedicated function.
3.4.2 Utility functions in R
While TMB
requires functions to transform working parameters to their natural form, pre-processing in R
requires the inverse transformation.
Functions to achieve this are available in Zucchini, MacDonald, and Langrock (2016, 51) and are displayed here with explaining comments for convenience.
# Transform Poisson natural parameters to working parameters
<- function(m, lambda, gamma, delta = NULL,
pois.HMM.pn2pw stationary = TRUE) {
<- log(lambda)
tlambda <- log(gamma / diag(gamma))
foo <- as.vector(foo[!diag(m)])
tgamma if (stationary) {
# If tdelta is set to NULL and returned in the list,
# it will cause issues when optimizing with TMB
return(list(tlambda = tlambda, tgamma = tgamma))
else {
} <- log(delta[- 1] / delta[1])
tdelta # TMB requires a list
return(list(tlambda = tlambda, tgamma = tgamma, tdelta = tdelta))
} }
This can be broken down into sub-functions if necessary. For the \({\boldsymbol\Gamma}\) part, we introduce gamma.n2w
below.
# Function to transform natural parameters to working ones
<- function(m, gamma) {
gamma.n2w <- log(gamma / diag(gamma))
foo <- as.vector(foo[!diag(m)])
tgamma return(tgamma)
}
In the case where a non-stationary distribution is specified, transforming \({\boldsymbol\delta}\) is necessary. For this we use the delta.n2w
function.
# Function to transform natural parameters to working ones
<- function(m, delta) {
delta.n2w <- log(delta[- 1] / delta[1])
tdelta return(tdelta)
}
When assuming a stationary distribution, computing it can be done via the following stat.dist
function.
# Compute the stationary distribution of a Markov chain
# with transition probability gamma
<- function(gamma) {
stat.dist # The code from Zucchini can crash when dealing with computationally small numbers.
# This is likely an approximation error.
# For some reason, the result may be a complex number with imaginary part equal to 0.
# We can just ignore the imaginary part because the eigenvectors of a real eigenvalue must be real.
# In the cases where it happened, solve(t(diag(m) - gamma + 1), rep(1, m)) produced the same result without the imaginary part.
<- Re(eigen(t(gamma))$vectors[, 1])
first_eigen_row return(first_eigen_row / sum(first_eigen_row))
}
Zucchini, MacDonald, and Langrock (2016, 51) shows that calculating the stationary distribution can be achieved by solving the equation below for \({\boldsymbol\delta}\), where \({\boldsymbol I}_m\) is the \(m*m\) identity matrix, \({\boldsymbol U}\) is a \(m*m\) matrix of ones, and \({\boldsymbol 1}\) is a row vector of ones. \[ {\boldsymbol\delta}({\boldsymbol I}_m - {\boldsymbol\Gamma} + {\boldsymbol U}) = {\boldsymbol 1} \]
3.4.2.1 Label switching
As the model gets estimated each time, no order is imposed on the states by default. If one needs to aggregate estimates, this can lead to the label switching problem, where states aren’t ordered the same way in each model and are grouped incorrectly. The issue can be relevant when comparing results of different optimizers, initial parameters, or classes of models.
To address this, we reorder the states by ascending Poisson means after estimation.
Sorting the means is direct, however re-ordering the TPM is not as straightforward.
To do so, we take the permutations of the states given by the sorted Poisson means, and permute each row index and column index to its new value.
The function pois.HMM.label.order
solves the issue and is presented below.
# Relabel states by increasing Poisson means
<- function(m,
pois.HMM.label.order
lambda,
gamma,delta = NULL,
lambda_std_error = NULL,
gamma_std_error = NULL,
delta_std_error = NULL,
smoothing_probs = NULL,
smoothing_probs_std_error = NULL,
ldecode = NULL,
indices = FALSE) {
if (anyNA(c(m, lambda, gamma))) {
return(NA)
}
# Remove vector names (optional, but looks better without redundancy)
names(lambda) <- NULL
names(lambda_std_error) <- NULL
names(delta) <- NULL
names(delta_std_error) <- NULL
# gamma_vector_indices is used to calculate the indices of the reordered TPM gamma as
# a vector for reordering the rows of the complete CI data.frame used for the article.
<- 1:(m ^ 2)
gamma_vector_indices <- matrix(gamma_vector_indices, nrow = m, ncol = m)
gamma_vector_matrix <- matrix(0, nrow = m, ncol = m)
ordered_gamma_vector_matrix
# Get the indices of the sorted states
# according to ascending lambda
# ordered_lambda contains the permutations needed
<- order(lambda)
ordered_lambda_indices <- lambda[ordered_lambda_indices]
ordered_lambda names(ordered_lambda) <- NULL
# Reorder the TPM according to the switched states
# in the sorted lambda
<- matrix(0, nrow = m, ncol = m)
ordered_gamma for (col in 1:m) {
<- which(ordered_lambda_indices == col)
new_col for (row in 1:m) {
<- which(ordered_lambda_indices == row)
new_row <- gamma[new_row, new_col]
ordered_gamma[row, col]
# Reorder the vector TPM
<- gamma_vector_matrix[new_row, new_col]
ordered_gamma_vector_matrix[row, col]
}
}
# Same for the TPM's standard errors
<- NULL
ordered_gamma_std_error if (!is.null(gamma_std_error)) {
<- matrix(0, nrow = m, ncol = m)
ordered_gamma_std_error for (col in 1:m) {
<- which(ordered_lambda_indices == col)
new_col for (row in 1:m) {
<- which(ordered_lambda_indices == row)
new_row <- gamma_std_error[new_row, new_col]
ordered_gamma_std_error[row, col]
}
}
}
# Reorder the stationary distribution if it is provided
# Generate it otherwise
<- if (!is.null(delta)) delta[ordered_lambda_indices] else stat.dist(ordered_gamma)
ordered_delta
# Reorder the smoothing probabilities (1 row per state, 1 column per data)
<- if (!is.null(smoothing_probs)) smoothing_probs[ordered_lambda_indices, ] else NULL
ordered_smoothing_probs # Reorder the decoded states
if (!is.null(ldecode)) {
<- sapply(ldecode, function(e) {
ldecode # ldecode = -1 for missing data
# Keep -1 in that case
if (is.na(e)) {
return(NA)
else {
} return(which(ordered_lambda_indices == e))
}
})else {
} <- NULL
ldecode
}
# Reorder the standard errors
<- lambda_std_error[ordered_lambda_indices]
ordered_lambda_std_error <- delta_std_error[ordered_lambda_indices]
ordered_delta_std_error # 1 row per state, 1 column per data
<- if (!is.null(smoothing_probs_std_error)) smoothing_probs_std_error[ordered_lambda_indices, ] else NULL
ordered_smoothing_probs_std_error
# The vector is assumed filled column-wise instead of row-wise,
# because column-wise is the default way R handles matrix to vector conversion.
# Change to row-wise if needed by replacing ordered_gamma_vector_matrix with
# t(ordered_gamma_vector_matrix), or add byrow=TRUE
# to "ordered_gamma_vector_matrix <- matrix(0, nrow = m, ncol = m)"
# We don't use it in case there is a bug, but it makes logical sense that it should work
<- as.numeric(ordered_gamma_vector_matrix)
ordered_gamma_vector_matrix
if (indices == TRUE) {
<- list(m = m,
result lambda = ordered_lambda,
gamma = ordered_gamma,
delta = ordered_delta,
lambda_std_error = ordered_lambda_std_error,
gamma_std_error = ordered_gamma_std_error,
delta_std_error = ordered_delta_std_error,
smoothing_probs = ordered_smoothing_probs,
smoothing_probs_std_error = ordered_smoothing_probs_std_error,
ldecode = ldecode,
ordered_lambda_indices = ordered_lambda_indices,
ordered_gamma_vector_indices = ordered_gamma_vector_matrix,
# delta and lambda are the same size, so they are ordered the same way
ordered_delta_indices = ordered_lambda_indices)
else {
} <- list(m = m,
result lambda = ordered_lambda,
gamma = ordered_gamma,
delta = ordered_delta,
lambda_std_error = ordered_lambda_std_error,
gamma_std_error = ordered_gamma_std_error,
delta_std_error = ordered_delta_std_error,
smoothing_probs = ordered_smoothing_probs,
smoothing_probs_std_error = ordered_smoothing_probs_std_error,
ldecode = ldecode)
}
# Remove the NULL elements
sapply(result, is.null)] <- NULL
result[
return(result)
}
We will go through an example to better understand the process. For readability, the TPM is filled with standard row and column indices instead of probabilities.
<- c(3, 1, 2)
lambda <- matrix(c(11, 12, 13,
gamma 21, 22, 23,
31, 32, 33), byrow = TRUE, ncol = 3)
pois.HMM.label.order(m = 3, lambda, gamma)
## $m
## [1] 3
##
## $lambda
## [1] 1 2 3
##
## $gamma
## [,1] [,2] [,3]
## [1,] 33 31 32
## [2,] 13 11 12
## [3,] 23 21 22
##
## $delta
## [1] 0.3482817 0.3183850 0.3333333
State 1 has been relabeled state 3, state 3 became state 2, and state 2 became state 1.
Another way to address this issue is by parameterizing in terms of non-negative increments \(\lambda_j - \lambda_{j-1}\) with \(\lambda_0 \equiv 0\), as explained by Zucchini, MacDonald, and Langrock (2016, sec. 7.1.1 p. 112). However, Bulla and Berzel (2008, sec. 3.2 p. 7) shows this can impose optimization issues: “over all series, the simplest parameterization, i.e., the use of log-transformed state-dependent parameters, leads to the best results as regards the number of failures and the convergence to the global maximum”.
These utility functions or subroutines are not complicated, but as you can see, they would cloud up your main code. Therefore, we put them in functions we can call from our main program.