Iterative linearised INLA method
Finn Lindgren and Man Ho Suen
Source:vignettes/method.Rmd
method.Rmd
The INLA method for linear predictors
The INLA method is used to compute fast approximative posterior distribution for Bayesian generalised additive models. The hierarchical structure of such a model with latent Gaussian components , covariance parameters , and measured response variables , can be written as where typically each linear predictor element, , is linked to a location parameter of the distribution for observation , for each , via a (non-linear) link function . In the R-INLA implementation, the observations are assumed to be conditionally independent, given and .
Approximate INLA for non-linear predictors
The premise for the inlabru method for non-linear predictors is to build on the existing implementation, and only add a linearisation step. The properties of the resulting approximation will depend on the nature of the non-linearity.
Let be a non-linear predictor, i.e.ย a deterministic function of , and let be the 1st order Taylor approximation at , where is the derivative matrix for the non-linear predictor, evaluated at . Hence, we define The non-linear observation model is approximated by replacing the non-linear predictor with its linearisation, so that the linearised model is defined by
The non-linear model posterior is factorised as and the linear model approximation is factorised as
Fixed point iteration
The remaining step of the approximation is how to choose the linearisation point . For a given linearisation point , INLA will compute the posterior mode for , and the joint conditional posterior mode for ,
Define the Bayesian estimation functional1 and let denote the corresponding posterior modes for the true posterior distribution,
The fixed point should ideally be close to , i.e.ย close to the true marginal/conditional posterior mode. We can achieve this for the conditional latent mode, so that .
We therefore seek the latent vector that generates the fixed point of the functional, so that .
One key to the fixed point iteration is that the observation model is linked to only through the non-linear predictor , since this leads to a simplified line search method below.
- Let be an initial linearisation point for the latent variables obtained from the initial INLA call. Iterate the following steps for
- Compute the predictor linearisation at .
- Compute the linearised INLA posterior .
- Let be the initial candidate for new linearisation point.
- Let , and find the value minimises .
- Set the new linearisation point equal to and repeat from step 1, unless the iteration has converged to a given tolerance.
A potential improvement of step 4 might be to also take into account the prior distribution for as a minimisation penalty, to avoid moving further than would be indicated by a full likelihood optimisation.
Line search
In step 4, we would ideally want
to be
However, since this requires access to
the internal likelihood and prior density evaluation code, we instead
use a simpler alternative. We consider norms of the form
that only depend on the nonlinear and linearised predictor expressions,
and other known quantities, given
,
such as the current INLA estimate of the component wise predictor
variances.
Let
be the current estimate of the posterior variance for each predictor
element
.
We then define an inner product on the space of predictor vectors as
The squared norm for the difference
between the predictor vectors
and
,with
respect to this inner product, is defined as
Using this norm as the target loss
function for the line search avoids many potentially expensive
evaluations of the true posterior conditional log-density. We evaluate
and make use of the linearised predictor information. Let
and
.
In other words,
corresponds to the previous linear predictor, and
is the current estimate from INLA. An exact line search would minimise
.
Instead, we define a quadratic approximation to the non-linear predictor
as a function of
,
and minimise the quartic polynomial in
,
If initial expansion and contraction
steps are carried out, leading to an initial guess of
,
where
is a scaling factor (see ?bru_options
,
bru_method$factor
) and
is the (signed) number of expansions and contractions, the quadratic
expression is replaced by
which is minimised on the interval
.
Posterior non-linearity checks
Whereas the inlabru optimisation method leads to an estimate where for a specific , the overall posterior approximation accuracy depends on the degree of nonlinearity in the vicinity of . There are two main options for evaluating this nonlinearity, using sampling from the approximate posterior distribution. The first option is which is the posterior expectation of the component-wise variance-normalised squared deviation between the non-linear and linearised predictor. Note that the normalising variance includes the variability induced by the posterior uncertainty for , whereas the norm used for the line search used only the posterior mode. Another option is which is the KullbackโLeibler divergence for the conditional posterior densities, , integrated over the approximate posterior distribution for . Implementing this would require access to the likelihood and prior distribution details. The next section explores this in more detail.
Accuracy
We wish to assess how accurate the approximation is. Thus, we compare and . With Bayesโ theorem, where is a Gaussian density and is a scaling factor that doesnโt depend on . We can therefore focus on the behaviour of for the exact and linearised observation models.
Recall that the observation likelihood only depends on through . Using a Taylor expansion with respect to and , Similarly, for each component of ,
where is the Hessian with respect to , are linear in , and are quadratic in . Combining the two expansions and taking the difference between the full and linearised log-likelihoods, we get Note that the log-likelihood Hessian difference contribution only involves third order terms and higher, so the expression above includes all terms up to second order.
Let
and
and form the sum of their products,
.
Then
With
and
,
we obtain
For each
configuration in the INLA output, we can extract both
and the sparse precision matrix
for the Gaussian approximation. The non-sparsity structure of
is contained in the non-sparsity of
,
which allows the use of Takahashi recursion (inla.qinv(Q)
)
to compute the corresponding
values needed to evaluate the trace
.
Thus, to implement a numerical approximation of this error analysis only
needs special access to the log-likelihood derivatives
,
as
can in principle be evaluated numerically.
For a given , The first term was approximated above. The second term can also be approximated using the derived quantities (to be continuedโฆ).
Summary: The form of the observation likelihood discrepancy shows that, given a linearised posterior , a Gaussian approximation to the nonlinear model posterior, , can be obtained from and . The K-L divergence becomes When the INLA posterior has mean , e.g.ย for models with additive Gaussian observation noise, and , the last term vanishes.
Note: by implementing the K-L divergence accuracy metric, a by-product would be improved posterior estimates based on and .
Well-posedness and initialisation
On a side note, one might be concerned about initialisation at, or convergence to, a saddle point. Although it is not implemented in inlabru, we want to talk about the technicality how we define the initial linearisation point .
Generally speaking, any values of
work except the case that the gradient evaluated at
is
because the linearisation point will never move away if the prior mean
is also
.
In general, this tends to be a saddle point problem. In some cases the
problem can be handled by changing the predictor parameterisation or
just changing the initialisation point using the
bru_initial
option. However, for true saddle point
problems, it indicates that the predictor parameterisation may lead to a
multimodal posterior distribution or is ill-posed in some other way.
This is a more fundamental problem that cannot be fixed by changing the
initialisation point.
In these examples, where and are latent Gaussian components, the predictors 1, 3, and 4 would typically be safe, but predictor 2 is fundamentally non-identifiable. Note that for and , the partial derivatives with respect to are zero for . However, the first inlabru iteration will give a non-zero estimate of , so that subsequent iteration will involve both and .