Skip to contents

The INLA method for linear predictors

The INLA method is used to compute fast approximative posterior distribution for Bayesian generalised additive models. The hierarchical structure of such a model with latent Gaussian components ๐ฎ\boldsymbol{u}, covariance parameters ๐›‰\boldsymbol{\theta}, and measured response variables ๐ฒ\boldsymbol{y}, can be written as ๐›‰โˆผp(๐›‰)๐ฎ|๐›‰โˆผ๐’ฉ(๐›u,๐(๐›‰)โˆ’1)๐›ˆ(๐ฎ)=๐€๐ฎ๐ฒ|๐ฎ,๐›‰โˆผp(๐ฒ|๐›ˆ(๐ฎ),๐›‰) \begin{aligned} \boldsymbol{\theta} &\sim p(\boldsymbol{\theta}) \\ \boldsymbol{u}|\boldsymbol{\theta} &\sim \mathcal{N}\!\left(\boldsymbol{\mu}_u, \boldsymbol{Q}(\boldsymbol{\theta})^{-1}\right) \\ \boldsymbol{\eta}(\boldsymbol{u}) &= \boldsymbol{A}\boldsymbol{u} \\ \boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta} & \sim p(\boldsymbol{y}|\boldsymbol{\eta}(\boldsymbol{u}),\boldsymbol{\theta}) \end{aligned} where typically each linear predictor element, ฮทi(๐ฎ)\eta_i(\boldsymbol{u}), is linked to a location parameter of the distribution for observation yiy_i, for each ii, via a (non-linear) link function gโˆ’1(โ‹…)g^{-1}(\cdot). In the R-INLA implementation, the observations are assumed to be conditionally independent, given ๐›ˆ\boldsymbol{\eta} and ๐›‰\boldsymbol{\theta}.

Approximate INLA for non-linear predictors

The premise for the inlabru method for non-linear predictors is to build on the existing implementation, and only add a linearisation step. The properties of the resulting approximation will depend on the nature of the non-linearity.

Let ๐›ˆฬƒ(๐ฎ)\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}) be a non-linear predictor, i.e.ย a deterministic function of ๐ฎ\boldsymbol{u}, ๐›ˆฬƒ(๐ฎ)=๐–ฟ๐–ผ๐—‡(๐ฎ), \widetilde{\boldsymbol{\eta}} (\boldsymbol{u}) = \textsf{fcn} (\boldsymbol{u}), and let ๐›ˆยฏ(๐ฎ)\overline{\boldsymbol{\eta}}(\boldsymbol{u}) be the 1st order Taylor approximation at ๐ฎ0\boldsymbol{u}_0, ๐›ˆยฏ(๐ฎ)=๐›ˆฬƒ(๐ฎ0)+๐(๐ฎโˆ’๐ฎ0)=[๐›ˆฬƒ(๐ฎ0)โˆ’๐๐ฎ0]+๐๐ฎ, \overline{\boldsymbol{\eta}}(\boldsymbol{u}) = \widetilde{\boldsymbol{\eta}}(\boldsymbol{u}_0) + \boldsymbol{B}(\boldsymbol{u} - \boldsymbol{u}_0) = \left[\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}_0) - \boldsymbol{B}\boldsymbol{u}_0\right] + \boldsymbol{B}\boldsymbol{u} , where ๐\boldsymbol{B} is the derivative matrix for the non-linear predictor, evaluated at ๐ฎ0\boldsymbol{u}_0. Hence, we define ๐ฒ|๐ฎ,๐›‰=d๐ฒ|๐›ˆฬƒ(๐ฎ),๐›‰โˆผp(๐ฒ|gโˆ’1[๐›ˆฬƒ(๐ฎ)],๐›‰) \begin{aligned} \boldsymbol{y} | \boldsymbol{u}, {\boldsymbol{\theta}} &\overset{d}{=} \boldsymbol{y} | \widetilde{\boldsymbol{\eta}}(\boldsymbol{u}), {\boldsymbol{\theta}} \\ &\sim p (\boldsymbol{y} | g^{-1}[\widetilde{\boldsymbol{\eta}}(\boldsymbol{u})], {\boldsymbol{\theta}})\\ \end{aligned} The non-linear observation model p(๐ฒ|gโˆ’1[๐›ˆฬƒ(๐ฎ)],๐›‰)p(\boldsymbol{y}|g^{-1}[\widetilde{\boldsymbol{\eta}}(\boldsymbol{u})],\boldsymbol{\theta}) is approximated by replacing the non-linear predictor with its linearisation, so that the linearised model is defined by

pยฏ(๐ฒ|๐ฎ,๐›‰)=p(๐ฒ|๐›ˆยฏ(๐ฎ),๐›‰)=p(๐ฒ|gโˆ’1[๐›ˆยฏ(๐ฎ)],๐›‰)โ‰ˆp(๐ฒ|gโˆ’1[๐›ˆฬƒ(๐ฎ)],๐›‰)=p(๐ฒ|๐›ˆฬƒ(๐ฎ),๐›‰)=pฬƒ(๐ฒ|๐ฎ,๐›‰) \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) = p(\boldsymbol{y}|\overline{\boldsymbol{\eta}}(\boldsymbol{u}),\boldsymbol{\theta}) = p(\boldsymbol{y}|g^{-1}[\overline{\boldsymbol{\eta}}(\boldsymbol{u})],\boldsymbol{\theta}) \approx p(\boldsymbol{y}|g^{-1}[\widetilde{\boldsymbol{\eta}}(\boldsymbol{u})],\boldsymbol{\theta}) = p(\boldsymbol{y}|\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}),\boldsymbol{\theta}) = \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) The non-linear model posterior is factorised as pฬƒ(๐›‰,๐ฎ|๐ฒ)=pฬƒ(๐›‰|๐ฒ)pฬƒ(๐ฎ|๐ฒ,๐›‰), \widetilde{p}(\boldsymbol{\theta},\boldsymbol{u}|\boldsymbol{y}) = \widetilde{p}(\boldsymbol{\theta}|\boldsymbol{y})\widetilde{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta}), and the linear model approximation is factorised as pยฏ(๐›‰,๐ฎ|๐ฒ)=pยฏ(๐›‰|๐ฒ)pยฏ(๐ฎ|๐ฒ,๐›‰). \overline{p}(\boldsymbol{\theta},\boldsymbol{u}|\boldsymbol{y}) = \overline{p}(\boldsymbol{\theta}|\boldsymbol{y})\overline{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta}).

Fixed point iteration

The remaining step of the approximation is how to choose the linearisation point ๐ฎ*\boldsymbol{u}_*. For a given linearisation point ๐ฏ\boldsymbol{v}, INLA will compute the posterior mode for ๐›‰\boldsymbol{\theta}, ๐›‰ฬ‚๐ฏ=argโ€†max๐›‰pยฏ๐ฏ(๐›‰|๐ฒ), \widehat{\boldsymbol{\theta}}_{\boldsymbol{v}} = \mathop{\mathrm{arg\,max}}_{\boldsymbol{\theta}} \overline{p}_\boldsymbol{v} ( {\boldsymbol{\theta}} | \boldsymbol{y} ), and the joint conditional posterior mode for ๐ฎ\boldsymbol{u}, ๐ฎฬ‚๐ฏ=argโ€†max๐ฎpยฏ๐ฏ(๐ฎ|๐ฒ,๐›‰ฬ‚๐ฏ). \widehat{\boldsymbol{u}}_{\boldsymbol{v}} = \mathop{\mathrm{arg\,max}}_{\boldsymbol{u}} \overline{p}_\boldsymbol{v} ( \boldsymbol{u} | \boldsymbol{y}, \widehat{\boldsymbol{\theta}}_{\boldsymbol{v}} ) .

Define the Bayesian estimation functional1 f(pยฏ๐ฏ)=(๐›‰ฬ‚๐ฏ,๐ฎฬ‚๐ฏ) f(\overline{p}_{\boldsymbol{v}}) = (\widehat{\boldsymbol{\theta}}_{\boldsymbol{v}},\widehat{\boldsymbol{u}}_{\boldsymbol{v}}) and let f(p)=(๐›‰ฬ‚,๐ฎฬ‚)f(p)=(\widehat{\boldsymbol{\theta}},\widehat{\boldsymbol{u}}) denote the corresponding posterior modes for the true posterior distribution, ๐›‰ฬ‚=argโ€†max๐›‰p(๐›‰|๐ฒ),๐ฎฬ‚=argโ€†max๐ฎp(๐ฎ|๐ฒ,๐›‰ฬ‚). \begin{aligned} \widehat{{\boldsymbol{\theta}}} &= \mathop{\mathrm{arg\,max}}_{\boldsymbol{\theta}} p ( {\boldsymbol{\theta}} | \boldsymbol{y} ), \\ \widehat{\boldsymbol{u}} &= \mathop{\mathrm{arg\,max}}_{\boldsymbol{u}} p (\boldsymbol{u} | \boldsymbol{y}, \widehat{{\boldsymbol{\theta}}}). \end{aligned}

The fixed point (๐›‰*,๐ฎ*)=f(pยฏ๐ฎ*)(\boldsymbol{\theta}_*,\boldsymbol{u}_*)=f(\overline{p}_{\boldsymbol{u}_*}) should ideally be close to (๐›‰ฬ‚,๐ฎฬ‚)(\widehat{\boldsymbol{\theta}},\widehat{\boldsymbol{u}}), i.e.ย close to the true marginal/conditional posterior mode. We can achieve this for the conditional latent mode, so that ๐ฎ*=argโ€†max๐ฎp(๐ฎ|๐ฒ,๐›‰ฬ‚๐ฎ*)\boldsymbol{u}_*=\mathop{\mathrm{arg\,max}}_{\boldsymbol{u}} p (\boldsymbol{u} | \boldsymbol{y}, \widehat{\boldsymbol{\theta}}_{\boldsymbol{u}_*}).

We therefore seek the latent vector ๐ฎ*\boldsymbol{u}_* that generates the fixed point of the functional, so that (๐›‰*,๐ฎ*)=f(pยฏ๐ฎ*)(\boldsymbol{\theta}_*,\boldsymbol{u}_*)=f(\overline{p}_{\boldsymbol{u}_*}).

One key to the fixed point iteration is that the observation model is linked to ๐ฎ\boldsymbol{u} only through the non-linear predictor ๐›ˆฬƒ(๐ฎ)\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}), since this leads to a simplified line search method below.

  1. Let ๐ฎ0\boldsymbol{u}_0 be an initial linearisation point for the latent variables obtained from the initial INLA call. Iterate the following steps for k=0,1,2,...k=0,1,2,...
  2. Compute the predictor linearisation at ๐ฎ0\boldsymbol{u}_0.
  3. Compute the linearised INLA posterior pยฏ๐ฎ0(๐›‰|๐ฒ)\overline{p}_{\boldsymbol{u}_0}(\boldsymbol{\theta}|\boldsymbol{y}).
  4. Let (๐›‰1,๐ฎ1)=(๐›‰ฬ‚๐ฎ0,๐ฎฬ‚๐ฎ0)=f(pยฏ๐ฎ0)(\boldsymbol{\theta}_1,\boldsymbol{u}_1)=(\widehat{\boldsymbol{\theta}}_{\boldsymbol{u}_0},\widehat{\boldsymbol{u}}_{\boldsymbol{u}_0})=f(\overline{p}_{\boldsymbol{u}_0}) be the initial candidate for new linearisation point.
  5. Let ๐ฏฮฑ=(1โˆ’ฮฑ)๐ฎ1+ฮฑ๐ฎ0\boldsymbol{v}_\alpha=(1-\alpha)\boldsymbol{u}_1+\alpha\boldsymbol{u}_0, and find the value ฮฑ\alpha minimises โˆฅฮทฬƒ(๐ฏฮฑ)โˆ’ฮทยฏ(๐ฎ1)โˆฅ\|\widetilde{\eta}(\boldsymbol{v}_\alpha)-\overline{\eta}(\boldsymbol{u}_1)\|.
  6. Set the new linearisation point ๐ฎ0\boldsymbol{u}_0 equal to ๐ฏฮฑ\boldsymbol{v}_\alpha and repeat from step 1, unless the iteration has converged to a given tolerance.

A potential improvement of step 4 might be to also take into account the prior distribution for ๐ฎ\boldsymbol{u} as a minimisation penalty, to avoid moving further than would be indicated by a full likelihood optimisation.

In step 4, we would ideally want ฮฑ\alpha to be
argโ€†maxฮฑ[lnp(๐ฎ|๐ฒ,๐›‰1)]๐ฎ=๐ฏฮฑ. \mathop{\mathrm{arg\,max}}_{\alpha} \left[\ln p(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta}_1)\right]_{\boldsymbol{u}=\boldsymbol{v}_\alpha}. However, since this requires access to the internal likelihood and prior density evaluation code, we instead use a simpler alternative. We consider norms of the form โˆฅฮทฬƒ(๐ฏฮฑ)โˆ’ฮทยฏ(๐ฎ1)โˆฅ\|\widetilde{\eta}(\boldsymbol{v}_\alpha)-\overline{\eta}(\boldsymbol{u}_1)\| that only depend on the nonlinear and linearised predictor expressions, and other known quantities, given ๐ฎ0\boldsymbol{u}_0, such as the current INLA estimate of the component wise predictor variances.

Let ฯƒi2=Var๐ฎโˆผpยฏ(๐ฎ|๐ฒ,๐›‰1)(๐›ˆยฏi(๐ฎ))\sigma_i^2 = \mathrm{Var}_{\boldsymbol{u}\sim \overline{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta}_1)}(\overline{\boldsymbol{\eta}}_i(\boldsymbol{u})) be the current estimate of the posterior variance for each predictor element ii. We then define an inner product on the space of predictor vectors as โŸจ๐š,๐›โŸฉV=โˆ‘iaibiฯƒi2. \langle \boldsymbol{a},\boldsymbol{b} \rangle_V = \sum_i \frac{a_i b_i}{\sigma_i^2} . The squared norm for the difference between the predictor vectors ๐›ˆฬƒ(๐ฏฮฑ)\widetilde{\boldsymbol{\eta}}(\boldsymbol{v}_\alpha) and ๐›ˆยฏ(๐ฎ1)\overline{\boldsymbol{\eta}}(\boldsymbol{u}_1),with respect to this inner product, is defined as โˆฅ๐›ˆฬƒ(๐ฏฮฑ)โˆ’๐›ˆยฏ(๐ฎ1)โˆฅV2=โˆ‘i|๐›ˆฬƒi(๐ฏฮฑ)โˆ’๐›ˆยฏi(๐ฎ1)|2ฯƒi2. \| \widetilde{\boldsymbol{\eta}}(\boldsymbol{v}_\alpha) - \overline{\boldsymbol{\eta}}(\boldsymbol{u}_1)\|^2_V = \sum_i \frac{|\widetilde{\boldsymbol{\eta}}_i(\boldsymbol{v}_\alpha)-\overline{\boldsymbol{\eta}}_i(\boldsymbol{u}_1)|^2}{\sigma_i^2} . Using this norm as the target loss function for the line search avoids many potentially expensive evaluations of the true posterior conditional log-density. We evaluate ๐›ˆฬƒ1=๐›ˆฬƒ(๐ฎ1)\widetilde{\boldsymbol{\eta}}_1=\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}_1) and make use of the linearised predictor information. Let ๐›ˆฬƒฮฑ=๐›ˆฬƒ(๐ฏฮฑ)\widetilde{\boldsymbol{\eta}}_\alpha=\widetilde{\boldsymbol{\eta}}(\boldsymbol{v}_\alpha) and ๐›ˆยฏฮฑ=๐›ˆยฏ(๐ฏฮฑ)=(1โˆ’ฮฑ)๐›ˆฬƒ(๐ฎ0)+ฮฑ๐›ˆยฏ(๐ฎ1)\overline{\boldsymbol{\eta}}_\alpha=\overline{\boldsymbol{\eta}}(\boldsymbol{v}_\alpha)=(1-\alpha)\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}_0)+\alpha\overline{\boldsymbol{\eta}}(\boldsymbol{u}_1). In other words, ฮฑ=0\alpha=0 corresponds to the previous linear predictor, and ฮฑ=1\alpha=1 is the current estimate from INLA. An exact line search would minimise โˆฅ๐›ˆฬƒฮฑโˆ’๐›ˆยฏ1โˆฅ\|\widetilde{\boldsymbol{\eta}}_\alpha-\overline{\boldsymbol{\eta}}_1\|. Instead, we define a quadratic approximation to the non-linear predictor as a function of ฮฑ\alpha, ๐›ˆฬ†ฮฑ=๐›ˆยฏฮฑ+ฮฑ2(๐›ˆฬƒ1โˆ’๐›ˆยฏ1) \breve{\boldsymbol{\eta}}_\alpha = \overline{\boldsymbol{\eta}}_\alpha + \alpha^2 (\widetilde{\boldsymbol{\eta}}_1 - \overline{\boldsymbol{\eta}}_1) and minimise the quartic polynomial in ฮฑ\alpha, โˆฅ๐›ˆฬ†ฮฑโˆ’๐›ˆยฏ1โˆฅ2=โˆฅ(ฮฑโˆ’1)(๐›ˆยฏ1โˆ’๐›ˆยฏ0)+ฮฑ2(๐›ˆฬƒ1โˆ’๐›ˆยฏ1)โˆฅ2. \begin{aligned} \|\breve{\boldsymbol{\eta}}_\alpha-\overline{\boldsymbol{\eta}}_1\|^2 &= \| (\alpha-1)(\overline{\boldsymbol{\eta}}_1 - \overline{\boldsymbol{\eta}}_0) + \alpha^2 (\widetilde{\boldsymbol{\eta}}_1 - \overline{\boldsymbol{\eta}}_1) \|^2 . \end{aligned} If initial expansion and contraction steps are carried out, leading to an initial guess of ฮฑ=ฮณk\alpha=\gamma^k, where ฮณ>1\gamma>1 is a scaling factor (see ?bru_options, bru_method$factor) and kk is the (signed) number of expansions and contractions, the quadratic expression is replaced by โˆฅ๐›ˆฬ†ฮฑโˆ’๐›ˆยฏ1โˆฅ2=โˆฅ(ฮฑโˆ’1)(๐›ˆยฏ1โˆ’๐›ˆยฏ0)+ฮฑ2ฮณ2k(๐›ˆฬƒฮณkโˆ’๐›ˆยฏฮณk)โˆฅ2, \begin{aligned} \|\breve{\boldsymbol{\eta}}_\alpha-\overline{\boldsymbol{\eta}}_1\|^2 &= \| (\alpha-1)(\overline{\boldsymbol{\eta}}_1 - \overline{\boldsymbol{\eta}}_0) + \frac{\alpha^2}{\gamma^{2k}} (\widetilde{\boldsymbol{\eta}}_{\gamma^k} - \overline{\boldsymbol{\eta}}_{\gamma^k}) \|^2 , \end{aligned} which is minimised on the interval ฮฑโˆˆ[ฮณkโˆ’1,ฮณk+1]\alpha\in[\gamma^{k-1},\gamma^{k+1}].

Posterior non-linearity checks

Whereas the inlabru optimisation method leads to an estimate where โˆฅ๐›ˆฬƒ(๐ฎ*)โˆ’๐›ˆยฏ(๐ฎ*)โˆฅ=0\| \widetilde{\boldsymbol{\eta}} (\boldsymbol{u}_*) - \overline{\boldsymbol{\eta}}(\boldsymbol{u}_*)\|=0 for a specific ๐ฎ*\boldsymbol{u}_*, the overall posterior approximation accuracy depends on the degree of nonlinearity in the vicinity of ๐ฎ*\boldsymbol{u}_*. There are two main options for evaluating this nonlinearity, using sampling from the approximate posterior distribution. The first option is โˆ‘iE๐ฎโˆผpยฏ(๐ฎ|๐ฒ)[|๐›ˆยฏi(๐ฎ)โˆ’๐›ˆฬƒi(๐ฎ)|2]Var๐ฎโˆผpยฏ(๐ฎ|๐ฒ)(๐›ˆยฏi(๐ฎ)), \begin{aligned} \sum_i \frac{E_{\boldsymbol{u}\sim \overline{p}(\boldsymbol{u}|\boldsymbol{y})}\left[ |\overline{\boldsymbol{\eta}}_i(\boldsymbol{u})-\widetilde{\boldsymbol{\eta}}_i(\boldsymbol{u})|^2\right]}{\mathrm{Var}_{\boldsymbol{u}\sim \overline{p}(\boldsymbol{u}|\boldsymbol{y})}(\overline{\boldsymbol{\eta}}_i(\boldsymbol{u}))} , \end{aligned} which is the posterior expectation of the component-wise variance-normalised squared deviation between the non-linear and linearised predictor. Note that the normalising variance includes the variability induced by the posterior uncertainty for ๐›‰\boldsymbol{\theta}, whereas the โˆฅโ‹…โˆฅV\|\cdot\|_V norm used for the line search used only the posterior mode. Another option is E(๐ฎ,๐›‰)โˆผpยฏ(๐ฎ,๐›‰|๐ฒ)[lnpยฏ(๐ฎ|๐ฒ,๐›‰)pฬƒ(๐ฎ|๐ฒ,๐›‰)]=E๐›‰โˆผpยฏ(๐›‰|๐ฒ){E๐ฎโˆผpยฏ(๐ฎ|๐ฒ,๐›‰)[lnpยฏ(๐ฎ|๐ฒ,๐›‰)pฬƒ(๐ฎ|๐ฒ,๐›‰)]} E_{(\boldsymbol{u},\boldsymbol{\theta})\sim \overline{p}(\boldsymbol{u},\boldsymbol{\theta}|\boldsymbol{y})} \left[\ln \frac{\overline{p}(\boldsymbol{u} |\boldsymbol{y},{\boldsymbol{\theta}})}{\widetilde{p}(\boldsymbol{u}|\boldsymbol{y},{\boldsymbol{\theta}})}\right] = E_{\boldsymbol{\theta}\sim \overline{p}(\boldsymbol{\theta}|\boldsymbol{y})} \left\{ E_{\boldsymbol{u}\sim \overline{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta})} \left[\ln \frac{\overline{p}(\boldsymbol{u} |\boldsymbol{y},{\boldsymbol{\theta}})}{\widetilde{p}(\boldsymbol{u}|\boldsymbol{y},{\boldsymbol{\theta}})}\right] \right\} which is the Kullbackโ€“Leibler divergence for the conditional posterior densities, ๐–ช๐–ซ(pยฏโˆฅpฬƒ)\mathsf{KL}\left(\overline{p}\,\middle\|\,\widetilde{p}\right), integrated over the approximate posterior distribution for ๐›‰\boldsymbol{\theta}. Implementing this would require access to the likelihood and prior distribution details. The next section explores this in more detail.

Accuracy

We wish to assess how accurate the approximation is. Thus, we compare pฬƒ(๐ฎ|๐ฒ,๐›‰)\widetilde{p}(\boldsymbol{u} | \boldsymbol{y}, \boldsymbol{\theta} ) and pยฏ(๐ฎ|๐ฒ,๐›‰)\overline{p}(\boldsymbol{u} |\boldsymbol{y},\boldsymbol{\theta}). With Bayesโ€™ theorem, p(๐ฎ|๐ฒ,๐›‰)=p(๐ฎ,๐ฒ|๐›‰)p(๐ฒ|๐›‰)=p(๐ฒ|๐ฎ,๐›‰)p(๐ฎ|๐›‰)p(๐ฒ|๐›‰), \begin{aligned} p(\boldsymbol{u}|\boldsymbol{y},{\boldsymbol{\theta}}) &= \frac{p(\boldsymbol{u},\boldsymbol{y}|{\boldsymbol{\theta}})}{p(\boldsymbol{y}|{\boldsymbol{\theta}})} \\ &= \frac{p(\boldsymbol{y}|\boldsymbol{u},{\boldsymbol{\theta}}) p(\boldsymbol{u}|{\boldsymbol{\theta}})}{p(\boldsymbol{y}|{\boldsymbol{\theta}})}, \end{aligned} where p(๐ฎ|๐›‰)p(\boldsymbol{u}|\boldsymbol{\theta}) is a Gaussian density and p(๐ฒ|๐›‰)p(\boldsymbol{y}|\boldsymbol{\theta}) is a scaling factor that doesnโ€™t depend on ๐ฎ\boldsymbol{u}. We can therefore focus on the behaviour of lnp(๐ฒ|๐›‰,๐ฎ)\ln p(\boldsymbol{y}|\boldsymbol{\theta},\boldsymbol{u}) for the exact and linearised observation models.

Recall that the observation likelihood only depends on ๐ฎ\boldsymbol{u} through ๐›ˆ\boldsymbol{\eta}. Using a Taylor expansion with respect to ๐›ˆ\boldsymbol{\eta} and ๐›ˆ*=๐›ˆฬƒ(๐ฎ*)\boldsymbol{\eta}^*=\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}_*), lnp(๐ฒ|๐›ˆ,๐›‰)=lnp(๐ฒ|๐›‰,๐›ˆ*))+โˆ‘iโˆ‚โˆ‚ฮทilnp(๐ฒ|๐›‰,๐›ˆ)|๐›ˆ*โ‹…(ฮทiโˆ’ฮทi*)+12โˆ‘i,jโˆ‚2โˆ‚ฮทiโˆ‚ฮทjlnp(๐ฒ|๐›‰,๐›ˆ)|๐›ˆ*โ‹…(ฮทiโˆ’ฮทi*)(ฮทjโˆ’ฮทj*)+๐’ช(โˆฅ๐›ˆโˆ’๐›ˆ*โˆฅ3), \begin{aligned} \ln p(\boldsymbol{y}|\boldsymbol{\eta},\boldsymbol{\theta}) &= \ln p (\boldsymbol{y}|{\boldsymbol{\theta}},\boldsymbol{\eta}^*)) \\ &\qquad + \sum_i \left.\frac{\partial}{\partial\eta_i} \ln p (\boldsymbol{y} | {\boldsymbol{\theta}}, \boldsymbol{\eta}) \right|_{\boldsymbol{\eta}^*}\cdot (\eta_i - \eta^*_i) \\ &\qquad + \frac{1}{2}\sum_{i,j} \left.\frac{\partial^2}{\partial\eta_i\partial\eta_j} \ln p (\boldsymbol{y} | {\boldsymbol{\theta}}, \boldsymbol{\eta}) \right|_{\boldsymbol{\eta}^*}\cdot (\eta_i - \eta^*_i) (\eta_j - \eta^*_j) + \mathcal{O}(\|\boldsymbol{\eta}-\boldsymbol{\eta}^*\|^3), \end{aligned} Similarly, for each component of ๐›ˆฬƒ\widetilde{\boldsymbol{\eta}}, ฮทฬƒi(๐ฎ)=ฮทi*+[โˆ‡uฮทฬƒi(๐ฎ)|๐ฎ*]โŠค(๐ฎโˆ’๐ฎ*)+12(๐ฎโˆ’๐ฎ*)โŠค[โˆ‡uโˆ‡uโŠคฮทฬƒi(๐ฎ)|๐ฎ*](๐ฎโˆ’๐ฎ*)+๐’ช(โˆฅ๐ฎโˆ’๐ฎ*โˆฅ3)=ฮทi*+bi(๐ฎ)+hi(๐ฎ)+๐’ช(โˆฅ๐ฎโˆ’๐ฎ*โˆฅ3)=ฮทยฏi(๐ฎ)+hi(๐ฎ)+๐’ช(โˆฅ๐ฎโˆ’๐ฎ*โˆฅ3) \begin{aligned} \widetilde{\eta}_i(\boldsymbol{u}) &= \eta^*_i +\left[\left.\nabla_{u}\widetilde{\eta}_i(\boldsymbol{u})\right|_{\boldsymbol{u}_*}\right]^\top (\boldsymbol{u} - \boldsymbol{u}_*) \\&\quad +\frac{1}{2}(\boldsymbol{u} - \boldsymbol{u}_*)^\top\left[\left.\nabla_{u}\nabla_{u}^\top\widetilde{\eta}_i(\boldsymbol{u})\right|_{\boldsymbol{u}_*}\right] (\boldsymbol{u} - \boldsymbol{u}_*) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}^*\|^3) \\&= \eta_i^* + b_i(\boldsymbol{u}) + h_i(\boldsymbol{u}) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3) \\&= \overline{\eta}_i(\boldsymbol{u}) + h_i(\boldsymbol{u}) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3) \end{aligned}

where โˆ‡uโˆ‡uโŠค\nabla_u\nabla_u^\top is the Hessian with respect to ๐ฎ\boldsymbol{u}, bib_i are linear in ๐ฎ\boldsymbol{u}, and hih_i are quadratic in ๐ฎ\boldsymbol{u}. Combining the two expansions and taking the difference between the full and linearised log-likelihoods, we get lnpฬƒ(๐ฒ|๐ฎ,๐›‰)โˆ’lnpยฏ(๐ฒ|๐ฎ,๐›‰)=โˆ‘iโˆ‚โˆ‚ฮทilnp(๐ฒ|๐›‰,๐›ˆ)|๐›ˆ*โ‹…hi(๐ฎ)+๐’ช(โˆฅ๐ฎโˆ’๐ฎ*โˆฅ3) \begin{aligned} \ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) &= \sum_i \left.\frac{\partial}{\partial\eta_i} \ln p (\boldsymbol{y} | {\boldsymbol{\theta}}, \boldsymbol{\eta}) \right|_{\boldsymbol{\eta}^*}\cdot h_i(\boldsymbol{u}) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3) \end{aligned} Note that the log-likelihood Hessian difference contribution only involves third order ๐ฎ\boldsymbol{u} terms and higher, so the expression above includes all terms up to second order.

Let gi*=โˆ‚โˆ‚ฮทilnp(๐ฒ|๐›‰,๐›ˆ)|๐›ˆ* g_i^*=\left.\frac{\partial}{\partial\eta_i} \ln p (\boldsymbol{y} | {\boldsymbol{\theta}}, \boldsymbol{\eta}) \right|_{\boldsymbol{\eta}^*} and ๐‡i*=โˆ‡uโˆ‡uโŠคฮทฬƒi(๐ฎ)|๐ฎ*. \boldsymbol{H}^*_i = \left.\nabla_{u}\nabla_{u}^\top\widetilde{\eta}_i(\boldsymbol{u})\right|_{\boldsymbol{u}_*} . and form the sum of their products, ๐†=โˆ‘igi*๐‡i*\boldsymbol{G}=\sum_i g_i^*\boldsymbol{H}_i^*. Then lnpฬƒ(๐ฒ|๐ฎ,๐›‰)โˆ’lnpยฏ(๐ฒ|๐ฎ,๐›‰)=12โˆ‘igi*(๐ฎโˆ’๐ฎ*)โŠค๐‡i*(๐ฎโˆ’๐ฎ*)+๐’ช(โˆฅ๐ฎโˆ’๐ฎ*โˆฅ3)=12(๐ฎโˆ’๐ฎ*)โŠค๐†(๐ฎโˆ’๐ฎ*)+๐’ช(โˆฅ๐ฎโˆ’๐ฎ*โˆฅ3). \begin{aligned} \ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) &= \frac{1}{2} \sum_i g_i^* (\boldsymbol{u}-\boldsymbol{u}_*)^\top \boldsymbol{H}_i^* (\boldsymbol{u}-\boldsymbol{u}_*) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3) \\&= \frac{1}{2} (\boldsymbol{u}-\boldsymbol{u}_*)^\top \boldsymbol{G} (\boldsymbol{u}-\boldsymbol{u}_*) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3). \end{aligned} With ๐ฆ=๐–คpยฏ(๐ฎ|๐ฒ,๐›‰)\boldsymbol{m}=\mathsf{E}_\overline{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta}) and ๐โˆ’1=๐–ข๐—ˆ๐—pยฏ(๐ฎ,๐ฎ|๐ฒ,๐›‰)\boldsymbol{Q}^{-1}=\mathsf{Cov}_\overline{p}(\boldsymbol{u},\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta}), we obtain ๐–คpยฏ[โˆ‡๐ฎ{lnpฬƒ(๐ฒ|๐ฎ,๐›‰)โˆ’lnpยฏ(๐ฒ|๐ฎ,๐›‰)}]โ‰ˆ๐†(๐ฆโˆ’๐ฎ*),๐–คpยฏ[โˆ‡๐ฎโˆ‡๐ฎโŠค{lnpฬƒ(๐ฒ|๐ฎ,๐›‰)โˆ’lnpยฏ(๐ฒ|๐ฎ,๐›‰)}]โ‰ˆ๐†,๐–คpยฏ[lnpฬƒ(๐ฒ|๐ฎ,๐›‰)โˆ’lnpยฏ(๐ฒ|๐ฎ,๐›‰)]โ‰ˆ12tr(๐†๐โˆ’1)+12(๐ฆโˆ’๐ฎ*)๐†(๐ฆโˆ’๐ฎ*)โŠค. \begin{aligned} \mathsf{E}_{\overline{p}}\left[ \nabla_{\boldsymbol{u}} \left\{\ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})\right\}\right] &\approx \boldsymbol{G}(\boldsymbol{m}-\boldsymbol{u}_*) , \\ \mathsf{E}_{\overline{p}}\left[ \nabla_{\boldsymbol{u}}\nabla_{\boldsymbol{u}}^\top \left\{\ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})\right\}\right] &\approx \boldsymbol{G} , \\ \mathsf{E}_{\overline{p}}\left[ \ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})\right] &\approx \frac{1}{2} \mathop{\mathrm{tr}}(\boldsymbol{G}\boldsymbol{Q}^{-1}) + \frac{1}{2} (\boldsymbol{m}-\boldsymbol{u}_*)\boldsymbol{G}(\boldsymbol{m}-\boldsymbol{u}_*)^\top . \end{aligned} For each ๐›‰\boldsymbol{\theta} configuration in the INLA output, we can extract both ๐ฆ\boldsymbol{m} and the sparse precision matrix ๐\boldsymbol{Q} for the Gaussian approximation. The non-sparsity structure of ๐†\boldsymbol{G} is contained in the non-sparsity of ๐\boldsymbol{Q}, which allows the use of Takahashi recursion (inla.qinv(Q)) to compute the corresponding ๐โˆ’1\boldsymbol{Q}^{-1} values needed to evaluate the trace tr(๐†๐โˆ’1)\mathop{\mathrm{tr}}(\boldsymbol{G}\boldsymbol{Q}^{-1}). Thus, to implement a numerical approximation of this error analysis only needs special access to the log-likelihood derivatives gi*g_i^*, as Hi*H_i^* can in principle be evaluated numerically.

For a given ๐›‰\boldsymbol{\theta}, ๐–ช๐–ซ(pยฏโˆฅpฬƒ)=Epยฏ[lnpยฏ(๐ฎ|๐ฒ,๐›‰)pฬƒ(๐ฎ|๐ฒ,๐›‰)]=Epยฏ[lnpยฏ(๐ฒ|๐ฎ,๐›‰)pฬƒ(๐ฒ|๐ฎ,๐›‰)]โˆ’lnpยฏ(๐ฒ|๐›‰)pฬƒ(๐ฒ|๐›‰). \begin{aligned} \mathsf{KL}\left(\overline{p}\,\middle\|\,\widetilde{p}\right) &= E_{\overline{p}}\left[\ln\frac{\overline{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta})}{\widetilde{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta})}\right] \\&= E_{\overline{p}}\left[ \ln\frac{\overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})}{\widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})} \right] - \ln\frac{\overline{p}(\boldsymbol{y}|\boldsymbol{\theta})}{\widetilde{p}(\boldsymbol{y}|\boldsymbol{\theta})} . \end{aligned} The first term was approximated above. The second term can also be approximated using the derived quantities (to be continuedโ€ฆ).

Summary: The form of the observation likelihood discrepancy shows that, given a linearised posterior ๐–ญ(๐ฆ,๐โˆ’1)\mathsf{N}(\boldsymbol{m},\boldsymbol{Q}^{-1}), a Gaussian approximation to the nonlinear model posterior, ๐–ญ(๐ฆฬƒ,๐ฬƒโˆ’1)\mathsf{N}(\widetilde{\boldsymbol{m}},\widetilde{\boldsymbol{Q}}^{-1}), can be obtained from ๐ฬƒ=๐โˆ’๐†\widetilde{\boldsymbol{Q}}=\boldsymbol{Q}-\boldsymbol{G} and ๐ฬƒ๐ฆฬƒ=๐๐ฆโˆ’๐†๐ฎ*\widetilde{\boldsymbol{Q}}\widetilde{\boldsymbol{m}}=\boldsymbol{Q}\boldsymbol{m}-\boldsymbol{G}\boldsymbol{u}_*. The K-L divergence becomes ๐–ช๐–ซ(pยฏโˆฅpฬƒ)โ‰ˆ12[lndet(๐)โˆ’lndet(๐โˆ’๐†)โˆ’tr(๐†๐โˆ’1)+(๐ฆโˆ’๐ฎ*)โŠค๐†(๐โˆ’๐†)โˆ’1๐†(๐ฆโˆ’๐ฎ*)]. \begin{aligned} \mathsf{KL}\left(\overline{p}\,\middle\|\,\widetilde{p}\right) &\approx \frac{1}{2} \left[ \ln\det(\boldsymbol{Q})- \ln\det(\boldsymbol{Q}-\boldsymbol{G}) -\mathop{\mathrm{tr}}\left(\boldsymbol{G}\boldsymbol{Q}^{-1}\right) + (\boldsymbol{m}-\boldsymbol{u}_*)^\top\boldsymbol{G}(\boldsymbol{Q}-\boldsymbol{G})^{-1}\boldsymbol{G}(\boldsymbol{m}-\boldsymbol{u}_*) \right] . \end{aligned} When the INLA posterior has mean ๐ฆ=๐ฎ*\boldsymbol{m}=\boldsymbol{u}_*, e.g.ย for models with additive Gaussian observation noise, and ๐›‰=๐›‰ฬ‚๐ฎ*\boldsymbol{\theta}=\widehat{\boldsymbol{\theta}}_{\boldsymbol{u}_*}, the last term vanishes.

Note: by implementing the K-L divergence accuracy metric, a by-product would be improved posterior estimates based on ๐ฆฬƒ\widetilde{\boldsymbol{m}} and ๐ฬƒ\widetilde{\boldsymbol{Q}}.

Well-posedness and initialisation

On a side note, one might be concerned about initialisation at, or convergence to, a saddle point. Although it is not implemented in inlabru, we want to talk about the technicality how we define the initial linearisation point u0u_0.

Generally speaking, any values of ๐ฎ0\boldsymbol{u}_0 work except the case that the gradient evaluated at ๐ฎ0\boldsymbol{u}_0 is ๐ŸŽ\boldsymbol{0} because the linearisation point will never move away if the prior mean is also ๐ŸŽ\boldsymbol{0}. In general, this tends to be a saddle point problem. In some cases the problem can be handled by changing the predictor parameterisation or just changing the initialisation point using the bru_initial option. However, for true saddle point problems, it indicates that the predictor parameterisation may lead to a multimodal posterior distribution or is ill-posed in some other way. This is a more fundamental problem that cannot be fixed by changing the initialisation point.

In these examples, where ฮฒ\beta and ๐ฎ\boldsymbol{u} are latent Gaussian components, the predictors 1, 3, and 4 would typically be safe, but predictor 2 is fundamentally non-identifiable. ๐›ˆ1=๐ฎ,๐›ˆ2=ฮฒ๐ฎ,๐›ˆ3=eฮฒ๐ฎ,๐›ˆ4=Fฮฒโˆ’1(ฮฆ(zฮฒ))๐ฎ,zฮฒโˆผ๐–ญ(0,1). \begin{aligned} \boldsymbol{\eta}_1 &= \boldsymbol{u}, \\ \boldsymbol{\eta}_2 &= \beta \boldsymbol{u}, \\ \boldsymbol{\eta}_3 &= e^\beta \boldsymbol{u}, \\ \boldsymbol{\eta}_4 &= F_\beta^{-1} ( \Phi(z_\beta)) \boldsymbol{u}, \quad z_{\beta} \sim \mathsf{N}(0,1) . \end{aligned} Note that for ๐›ˆ3\boldsymbol{\eta}_3 and ๐›ˆ4\boldsymbol{\eta}_4, the partial derivatives with respect to ฮฒ\beta are zero for ๐ฎ=๐ŸŽ\boldsymbol{u}=\boldsymbol{0}. However, the first inlabru iteration will give a non-zero estimate of ๐ฎ\boldsymbol{u}, so that subsequent iteration will involve both ฮฒ\beta and ๐ฎ\boldsymbol{u}.