3 Parameter estimation techniques
for HMMs

In this section we recall basic concepts underlying parameter estimation for HMMs via direct numerical optimization of the likelihood. In terms of notation, we stay as close as possible to Zucchini, MacDonald, and Langrock (2016), where a more detailed presentation is available.

3.1 Basic notation and model setup

The conditional distribution densities can be called in R as well as in C++ (provided that the TMB template is respected) via

dpois(x = 1, lambda = 5) # Poisson
## [1] 0.03368973
dnorm(x = 1, mean = 0, sd = 1) # Normal
## [1] 0.2419707

Other distributions are available.

We let TPM denote the transition probability matrix, and reference it with the variable \({\boldsymbol\Gamma}\).

3.2 The likelihood function of an HMM

The Poisson HMM negative log-likelihood function in C++ can be written the following way using TMB’s template.

#include <TMB.hpp>
#include "../functions/utils.cpp"

// Likelihood for a poisson hidden markov model. 
template<class Type>
Type objective_function<Type>::operator() ()
{
  
  // Data
  DATA_VECTOR(x);          // timeseries vector
  DATA_INTEGER(m);         // Number of states m
  
  // Parameters
  PARAMETER_VECTOR(tlambda);     // conditional log_lambdas's
  PARAMETER_VECTOR(tgamma);      // m(m-1) working parameters of TPM
  
  // Uncomment only when using a non-stationary distribution
  //PARAMETER_VECTOR(tdelta);    // transformed stationary distribution,
  
  // Transform working parameters to natural parameters:
  vector<Type> lambda = tlambda.exp();
  matrix<Type> gamma = gamma_w2n(m, tgamma);
  
  // Construct stationary distribution
  vector<Type> delta = stat_dist(m, gamma);
  // If using a non-stationary distribution, use this instead
  //vector<Type> delta = delta_w2n(m, tdelta);
  
  // Get number of timesteps (n)
  int n = x.size();
  
  // Evaluate conditional distribution: Put conditional
  // probabilities of observed x in n times m matrix
  // (one column for each state, one row for each datapoint):
  matrix<Type> emission_probs(n, m);
  matrix<Type> row1vec(1, m);
  row1vec.setOnes();
  for (int i = 0; i < n; i++) {
    if (x[i] != x[i]) { // f != f returns true if and only if f is NaN. 
      // Replace missing values (NA in R, NaN in C++) with 1
      emission_probs.row(i) = row1vec;
    }
    else {
      emission_probs.row(i) = dpois(x[i], lambda, false);
    }
  }
  
  // Corresponds to (Zucchini et al., 2016, p 333)
  matrix<Type> foo, P;
  Type mllk, sumfoo, lscale;
  
  foo = (delta * vector<Type>(emission_probs.row(0))).matrix();
  sumfoo = foo.sum();
  lscale = log(sumfoo);
  foo.transposeInPlace();
  foo /= sumfoo;
  for (int i = 2; i <= n; i++) {
    P = emission_probs.row(i - 1);
    foo = ((foo * gamma).array() * P.array()).matrix();
    sumfoo = foo.sum();
    lscale += log(sumfoo);
    foo /= sumfoo;
  }
  mllk = -lscale;
  
  // Use adreport on variables for which we want standard errors
  ADREPORT(lambda);
  ADREPORT(gamma);
  ADREPORT(delta);
  
  // Variables we need for local decoding and in a convenient format
  REPORT(lambda);
  REPORT(gamma);
  REPORT(delta);
  REPORT(n);
  REPORT(emission_probs);
  REPORT(mllk);
  
  return mllk;
}

The case where \(m = 1\) doesn’t involve a hidden state, and thus is a Poisson regression instead of a Poisson HMM. Nonetheless, TMB also accelerates its estimation and may be useful to the reader.

3.3 Forward algorithm and
backward algorithm

3.3.1 Setup

3.3.1.1 Prepare the model

library(TMB)
source("functions/utils.R")
load("data/fetal-lamb.RData")
TMB::compile("code/poi_hmm.cpp")
dyn.load(dynlib("code/poi_hmm"))
lamb_data <- lamb
m <- 2
TMB_data <- list(x = lamb_data, m = m)
lambda <- c(1, 3)
gamma <- matrix(c(0.8, 0.2,
                  0.2, 0.8), byrow = TRUE, nrow = m)
parameters <- pois.HMM.pn2pw(m, lambda, gamma)
obj_tmb <- MakeADFun(TMB_data, parameters,
                     DLL = "poi_hmm", silent = TRUE)
mod_tmb <- nlminb(start = obj_tmb$par, objective = obj_tmb$fn,
                  gradient = obj_tmb$gr, hessian = obj_tmb$he)

3.3.1.2 Define variables

Given an optimized MakeADFun object obj, we need to setup some variables to compute the probabilities detailed below.

# Retrieve the objects at ML value
adrep <- obj_tmb$report(obj_tmb$env$last.par.best)
delta <- adrep$delta
gamma <- adrep$gamma
emission_probs <- adrep$emission_probs
n <- adrep$n
m <- length(delta)
mllk <- adrep$mllk

Note that \({\boldsymbol\lambda}\) is not needed as we already have access to the emission probabilities.

3.3.1.3 Emission/output probabilities

Emission probabilities (also called output probabilities) are the conditional probabilities of the data given a state.

Using the notation in the article Section 3.1, emission probabilities are the conditional distributions \(p_i(x) = \text{P}(X_t = x \vert C_t = i) = \frac{e^{-\lambda_i} \lambda_i^x}{x!}\). We store all these in a data frame where each row and column represent a data and a state respectively.

More formally, we call \(\begin{pmatrix} p_1(x_1) & p_2(x_1) & \ldots & p_m(x_1)\\ p_1(x_2) & p_2(x_2) & \ldots & p_m(x_2)\\ \vdots & \vdots & \ddots & \vdots\\ p_1(x_n) & p_2(x_n) & \ldots & p_m(x_n)\\ \end{pmatrix}\) the emission probability matrix.

We compute these probabilities in C++ rather than in R because it is faster and not more complicated to write

  // Evaluate conditional distribution: Put conditional
  // probabilities of observed x in n times m matrix
  // (one column for each state, one row for each datapoint):
  matrix<Type> emission_probs(n, m);
  matrix<Type> row1vec(1, m);
  row1vec.setOnes();
  for (int i = 0; i < n; i++) {
    if (x[i] != x[i]) { // f != f returns true if and only if f is NaN. 
      // Replace missing values (NA in R, NaN in C++) with 1
      emission_probs.row(i) = row1vec;
    }
    else {
      emission_probs.row(i) = dpois(x[i], lambda, false);
    }
  }

Nevertheless, we also need to compute them in R to Forecast

# Calculate emission probabilities
get.emission.probs <- function(data, lambda) {
  n <- length(data)
  m <- length(lambda)
  emission_probs <- matrix(0, nrow = n, ncol = m)
  for (i in 1:n) {
    if (is.na(data[i])) {
      emission_probs[i, ] <- rep(1, m)
    } else {
      emission_probs[i, ] <- dpois(data[i], lambda)
    }
  }
  return(emission_probs)
}

3.3.2 Log-forward probabilities

We show here a way to compute the log of the forward probabilities, using a scaling scheme defined by Zucchini, MacDonald, and Langrock (2016, 66 and p.334).

# Compute log-forward probabilities (scaling used)
lalpha <- matrix(NA, m, n)
foo <- delta * emission_probs[1, ]
sumfoo <- sum(foo)
lscale <- log(sumfoo)
foo <- foo / sumfoo
lalpha[, 1] <- log(foo) + lscale
for (i in 2:n) {
  foo <- foo %*% gamma * emission_probs[i, ]
  sumfoo <- sum(foo)
  lscale <- lscale + log(sumfoo)
  foo <- foo / sumfoo
  lalpha[, i] <- log(foo) + lscale
}
# lalpha contains n=240 columns, so we only display 5 for readability
lalpha[, 1:5]
##            [,1]       [,2]       [,3]      [,4]      [,5]
## [1,] -0.2920639 -0.5591179 -0.8265948 -1.094088 -1.361583
## [2,] -6.4651986 -7.7716769 -8.1146145 -8.385232 -8.652850

3.3.3 Log-backward probabilities

We show here a way to compute the log of the backward probabilities with a scaling scheme defined by Zucchini, MacDonald, and Langrock (2016, 67 and p.334).

# Compute log-backwards probabilities (scaling used)
lbeta <- matrix(NA, m, n)
lbeta[, n] <- rep(0, m)
foo <- rep (1 / m, m)
lscale <- log(m)
for (i in (n - 1):1) {
  foo <- gamma %*% (emission_probs[i + 1, ] * foo)
  lbeta[, i] <- log(foo) + lscale
  sumfoo <- sum(foo)
  foo <- foo / sumfoo
  lscale <- lscale + log(sumfoo)
}
# lbeta contains n=240 columns, so we only display 4 for readability
lbeta[, 1:4]
##           [,1]      [,2]      [,3]      [,4]
## [1,] -177.2275 -176.9600 -176.6925 -176.4250
## [2,] -178.3456 -178.0781 -177.8099 -177.5253

We refer to the Section State Inference for an application of this code for state inference and forecasting.

3.4 Reparameterization of the
likelihood function

3.4.1 Utility functions in TMB

Defining the negative log-likelihood function requires transforming the working parameters into their natural form. We define the function gamma.w2n to perform this transformation.

The function stat.dist handles computing the stationary distribution, while delta.w2n can be used if a non-stationary distribution is provided.

They are defined in functions/utils.cpp

// Function transforming working parameters in initial distribution
// to natural parameters
template<class Type>
vector<Type> delta_w2n(int m, vector<Type> tdelta) {

  vector<Type> delta(m);
  vector<Type> foo(m);
  
  if (m == 1)
    return Type(1);
  
  // set first element to one.
  // Fill in the last m - 1 elements with working parameters
  // and take exponential
  foo << Type(1), tdelta.exp();

  // normalize
  delta = foo / foo.sum();

  return delta;
}

// Function transforming the working parameters in TPM to
// natural parameters (w2n)
template<class Type>
matrix<Type> gamma_w2n(int m, vector<Type> tgamma) {

  // Construct m x m identity matrix
  matrix<Type> gamma(m, m);
  gamma.setIdentity();
  
  if (m == 1)
    return gamma;

  // Fill offdiagonal elements with working parameters column-wise:
  int idx = 0;
  for (int i = 0; i < m; i++){
    for (int j = 0; j < m; j++){
      if (j != i){
        // Fill gamma according to mapping and take exponential
        gamma(j, i) = tgamma.exp()(idx);
        idx++;
      }
    }
  }
  // Normalize each row:
  vector<Type> cs = gamma.rowwise().sum();
  for (int i = 0; i < m; i++) gamma.row(i) /= cs[i];

  return gamma;
}

// Function computing the stationary distribution of a Markov chain
template<class Type>
vector<Type> stat_dist(int m, matrix<Type> gamma) {
  
  // Construct stationary distribution
  matrix<Type> I(m, m);
  matrix<Type> U(m, m);
  matrix<Type> row1vec(1, m);
  U = U.setOnes();
  I = I.setIdentity();
  row1vec.setOnes();
  matrix<Type> A = I - gamma + U;
  matrix<Type> Ainv = A.inverse();
  matrix<Type> deltamat = row1vec * Ainv;
  vector<Type> delta = deltamat.row(0);
  
  return delta;
}

// // Function computing log-forward probabilities (scaling used)
// template<class Type>
// vector<Type> lalpha(int n,
// // matrix<Type> lalpha(int n,
//                     int m,
//                     vector<Type> delta,
//                     matrix<Type> gamma,
//                     matrix<Type> emission_probs) {
//   
//   matrix<Type> lalpha(m, n);
//   vector<Type> foo = delta * vector<Type>(emission_probs.row(0));
//   Type sumfoo = foo.sum();
//   Type lscale = sumfoo.log();
//   foo /= sumfoo;
//   // lalpha.row(0) << foo.log() + lscale;
//   
//   return foo;
// }

Transforming the Poisson means into their natural form can be done simply with the exp function and doesn’t require a dedicated function.

3.4.2 Utility functions in R

While TMB requires functions to transform working parameters to their natural form, pre-processing in R requires the inverse transformation.

Functions to achieve this are available in Zucchini, MacDonald, and Langrock (2016, 51) and are displayed here with explaining comments for convenience.

# Transform Poisson natural parameters to working parameters
pois.HMM.pn2pw <- function(m, lambda, gamma, delta = NULL,
                           stationary = TRUE) {
  tlambda <- log(lambda)
  foo <- log(gamma / diag(gamma))
  tgamma <- as.vector(foo[!diag(m)])
  if (stationary) {
    # If tdelta is set to NULL and returned in the list,
    # it will cause issues when optimizing with TMB
    return(list(tlambda = tlambda, tgamma = tgamma))
  } else {
    tdelta <- log(delta[- 1] / delta[1])
    # TMB requires a list
    return(list(tlambda = tlambda, tgamma = tgamma, tdelta = tdelta))
  }
}

This can be broken down into sub-functions if necessary. For the \({\boldsymbol\Gamma}\) part, we introduce gamma.n2w below.

# Function to transform natural parameters to working ones
gamma.n2w <- function(m, gamma) {
  foo <- log(gamma / diag(gamma))
  tgamma <- as.vector(foo[!diag(m)])
  return(tgamma)
}

In the case where a non-stationary distribution is specified, transforming \({\boldsymbol\delta}\) is necessary. For this we use the delta.n2w function.

# Function to transform natural parameters to working ones
delta.n2w <- function(m, delta) {
  tdelta <- log(delta[- 1] / delta[1])
  return(tdelta) 
}

When assuming a stationary distribution, computing it can be done via the following stat.dist function.

# Compute the stationary distribution of a Markov chain
# with transition probability gamma
stat.dist <- function(gamma) {
  # The code from Zucchini can crash when dealing with computationally small numbers.
  # This is likely an approximation error.
  # For some reason, the result may be a complex number with imaginary part equal to 0.
  # We can just ignore the imaginary part because the eigenvectors of a real eigenvalue must be real.
  # In the cases where it happened, solve(t(diag(m) - gamma + 1), rep(1, m)) produced the same result without the imaginary part.
  first_eigen_row <- Re(eigen(t(gamma))$vectors[, 1])
  return(first_eigen_row / sum(first_eigen_row))
}

Zucchini, MacDonald, and Langrock (2016, 51) shows that calculating the stationary distribution can be achieved by solving the equation below for \({\boldsymbol\delta}\), where \({\boldsymbol I}_m\) is the \(m*m\) identity matrix, \({\boldsymbol U}\) is a \(m*m\) matrix of ones, and \({\boldsymbol 1}\) is a row vector of ones. \[ {\boldsymbol\delta}({\boldsymbol I}_m - {\boldsymbol\Gamma} + {\boldsymbol U}) = {\boldsymbol 1} \]

3.4.2.1 Label switching

As the model gets estimated each time, no order is imposed on the states by default. If one needs to aggregate estimates, this can lead to the label switching problem, where states aren’t ordered the same way in each model and are grouped incorrectly. The issue can be relevant when comparing results of different optimizers, initial parameters, or classes of models.

To address this, we reorder the states by ascending Poisson means after estimation. Sorting the means is direct, however re-ordering the TPM is not as straightforward. To do so, we take the permutations of the states given by the sorted Poisson means, and permute each row index and column index to its new value. The function pois.HMM.label.order solves the issue and is presented below.

# Relabel states by increasing Poisson means
pois.HMM.label.order <- function(m,
                                 lambda,
                                 gamma,
                                 delta = NULL,
                                 lambda_std_error = NULL,
                                 gamma_std_error = NULL,
                                 delta_std_error = NULL,
                                 smoothing_probs = NULL,
                                 smoothing_probs_std_error = NULL,
                                 ldecode = NULL,
                                 indices = FALSE) {
  if (anyNA(c(m, lambda, gamma))) {
    return(NA)
  }
  
  # Remove vector names (optional, but looks better without redundancy)
  names(lambda) <- NULL
  names(lambda_std_error) <- NULL
  names(delta) <- NULL
  names(delta_std_error) <- NULL
  
  # gamma_vector_indices is used to calculate the indices of the reordered TPM gamma as
  # a vector for reordering the rows of the complete CI data.frame used for the article.
  gamma_vector_indices <- 1:(m ^ 2)
  gamma_vector_matrix <- matrix(gamma_vector_indices, nrow = m, ncol = m)
  ordered_gamma_vector_matrix <- matrix(0, nrow = m, ncol = m)
  
  # Get the indices of the sorted states
  # according to ascending lambda
  # ordered_lambda contains the permutations needed
  ordered_lambda_indices <- order(lambda)
  ordered_lambda <- lambda[ordered_lambda_indices]
  names(ordered_lambda) <- NULL
  # Reorder the TPM according to the switched states
  # in the sorted lambda
  ordered_gamma <- matrix(0, nrow = m, ncol = m)
  for (col in 1:m) {
    new_col <- which(ordered_lambda_indices == col)
    for (row in 1:m) {
      new_row <- which(ordered_lambda_indices == row)
      ordered_gamma[row, col] <- gamma[new_row, new_col]
      
      # Reorder the vector TPM
      ordered_gamma_vector_matrix[row, col] <- gamma_vector_matrix[new_row, new_col]
    }
  }
  
  # Same for the TPM's standard errors
  ordered_gamma_std_error <- NULL
  if (!is.null(gamma_std_error)) {
    ordered_gamma_std_error <- matrix(0, nrow = m, ncol = m)
    for (col in 1:m) {
      new_col <- which(ordered_lambda_indices == col)
      for (row in 1:m) {
        new_row <- which(ordered_lambda_indices == row)
        ordered_gamma_std_error[row, col] <- gamma_std_error[new_row, new_col]
      }
    }
  }
  
  # Reorder the stationary distribution if it is provided
  # Generate it otherwise
  ordered_delta <- if (!is.null(delta)) delta[ordered_lambda_indices] else stat.dist(ordered_gamma)
  
  # Reorder the smoothing probabilities (1 row per state, 1 column per data)
  ordered_smoothing_probs <- if (!is.null(smoothing_probs)) smoothing_probs[ordered_lambda_indices, ] else NULL
  # Reorder the decoded states
  if (!is.null(ldecode)) {
    ldecode <- sapply(ldecode, function(e) {
      # ldecode = -1 for missing data
      # Keep -1 in that case
      if (is.na(e)) {
        return(NA)
      } else {
        return(which(ordered_lambda_indices == e))
      }
    })
  } else {
    ldecode <- NULL
  }
  
  # Reorder the standard errors
  ordered_lambda_std_error <- lambda_std_error[ordered_lambda_indices]
  ordered_delta_std_error <- delta_std_error[ordered_lambda_indices]
  # 1 row per state, 1 column per data
  ordered_smoothing_probs_std_error <- if (!is.null(smoothing_probs_std_error)) smoothing_probs_std_error[ordered_lambda_indices, ] else NULL
  
  # The vector is assumed filled column-wise instead of row-wise,
  # because column-wise is the default way R handles matrix to vector conversion.
  # Change to row-wise if needed by replacing ordered_gamma_vector_matrix with
  # t(ordered_gamma_vector_matrix), or add byrow=TRUE
  # to "ordered_gamma_vector_matrix <- matrix(0, nrow = m, ncol = m)"
  # We don't use it in case there is a bug, but it makes logical sense that it should work
  ordered_gamma_vector_matrix <- as.numeric(ordered_gamma_vector_matrix)
  
  if (indices == TRUE) {
    result <- list(m = m,
                   lambda = ordered_lambda,
                   gamma = ordered_gamma,
                   delta = ordered_delta,
                   lambda_std_error = ordered_lambda_std_error,
                   gamma_std_error = ordered_gamma_std_error,
                   delta_std_error = ordered_delta_std_error,
                   smoothing_probs = ordered_smoothing_probs,
                   smoothing_probs_std_error = ordered_smoothing_probs_std_error,
                   ldecode = ldecode,
                   ordered_lambda_indices = ordered_lambda_indices,
                   ordered_gamma_vector_indices = ordered_gamma_vector_matrix,
                   # delta and lambda are the same size, so they are ordered the same way
                   ordered_delta_indices = ordered_lambda_indices)
  } else {
    result <- list(m = m,
                   lambda = ordered_lambda,
                   gamma = ordered_gamma,
                   delta = ordered_delta,
                   lambda_std_error = ordered_lambda_std_error,
                   gamma_std_error = ordered_gamma_std_error,
                   delta_std_error = ordered_delta_std_error,
                   smoothing_probs = ordered_smoothing_probs,
                   smoothing_probs_std_error = ordered_smoothing_probs_std_error,
                   ldecode = ldecode)
  }
  
  # Remove the NULL elements
  result[sapply(result, is.null)] <- NULL
  
  return(result)
}

We will go through an example to better understand the process. For readability, the TPM is filled with standard row and column indices instead of probabilities.

lambda <- c(3, 1, 2)
gamma <- matrix(c(11, 12, 13,
                  21, 22, 23,
                  31, 32, 33), byrow = TRUE, ncol = 3)
pois.HMM.label.order(m = 3, lambda, gamma)
## $m
## [1] 3
## 
## $lambda
## [1] 1 2 3
## 
## $gamma
##      [,1] [,2] [,3]
## [1,]   33   31   32
## [2,]   13   11   12
## [3,]   23   21   22
## 
## $delta
## [1] 0.3482817 0.3183850 0.3333333

State 1 has been relabeled state 3, state 3 became state 2, and state 2 became state 1.

Another way to address this issue is by parameterizing in terms of non-negative increments \(\lambda_j - \lambda_{j-1}\) with \(\lambda_0 \equiv 0\), as explained by Zucchini, MacDonald, and Langrock (2016, sec. 7.1.1 p. 112). However, Bulla and Berzel (2008, sec. 3.2 p. 7) shows this can impose optimization issues: “over all series, the simplest parameterization, i.e., the use of log-transformed state-dependent parameters, leads to the best results as regards the number of failures and the convergence to the global maximum”.

These utility functions or subroutines are not complicated, but as you can see, they would cloud up your main code. Therefore, we put them in functions we can call from our main program.