Skip to contents

Why aggregate data?

Meta-analysis in pharmacometrics means combining evidence across studies to obtain parameter estimates that no single study could support on its own. The classical approach pools individual patient data (IPD) from multiple studies into one dataset. This is scientifically ideal but practically rare: data sharing agreements, proprietary restrictions, and the simple fact that most published work reports only summary statistics make IPD pooling the exception rather than the rule.

Model-based meta-analysis (MBMA) addresses this by fitting pharmacometric models directly to aggregate summaries. These summaries come from two distinct sources. The first is directly published: many clinical studies report a mean concentration-time profile together with the variance at each time point, and sometimes the full covariance matrix. The second is model-derived: a published pharmacometric model with its parameter estimates (fixed effects, random-effects variance, residual error) can be used to generate the expected mean vector and covariance matrix via simulation, without access to the original data. The covariance structure carries information about the shape of individual profiles that a variance-only summary would discard — two studies can have identical variances at each time point yet very different correlation structures, reflecting different sources of variability. Both types of aggregate data, whether published directly or derived from a published model, can be analysed within the same framework. This makes the full literature — competitor compounds, historical datasets, regulatory submissions — available as modelling input for dose selection, trial design, and disease-progression modelling.

The aggregate data likelihood, first systematically described by Välitalo (2021), provides the statistical foundation for MBMA: given a nonlinear mixed-effects model, what is the probability of observing the reported mean and covariance? The first R implementation was admr, described in van de Beek et al. (2025). admixr2 re-implements the same approach within the nlmixr2/rxode2 ecosystem, extending it with new algorithmic ideas that are the focus of this post.


The aggregate likelihood

Observed data

For a single study with nn subjects, each measured at observation times t1,,tTt_1, \ldots, t_T, the observed data are two sufficient statistics:

yT,ST×T \bar{y} \in \mathbb{R}^T, \qquad S \in \mathbb{R}^{T \times T}

where y\bar{y} is the observed mean vector and SS is the observed covariance matrix (ML denominator nn, not n1n - 1).

The model

A nonlinear mixed-effects model specifies that subject ii’s observations arise from

yi=f(θ,ηi)+εi,ηi𝒩(0,Ω),εi𝒩(0,Σ) y_i = f(\theta, \eta_i) + \varepsilon_i, \qquad \eta_i \sim \mathcal{N}(0, \Omega), \quad \varepsilon_i \sim \mathcal{N}(0, \Sigma)

where f(θ,)f(\theta, \cdot) is an ODE system (or closed-form prediction function) with structural parameters θ\theta and individual random effects ηi\eta_i. Ωd×d\Omega \in \mathbb{R}^{d \times d} is the between-subject covariance, ΣT×T\Sigma \in \mathbb{R}^{T \times T} is the residual error covariance (typically a diagonal function of the structural prediction, e.g. proportional error).

Population moments

Under this model the population-level distribution of yiy_i has mean and covariance

μpred=𝔼η[f(θ,η)],Vpred=Varη[f(θ,η)]+Σ. \mu_\text{pred} = \mathbb{E}_\eta\bigl[f(\theta, \eta)\bigr], \qquad V_\text{pred} = \operatorname{Var}_\eta\bigl[f(\theta, \eta)\bigr] + \Sigma.

The aggregate log-likelihood

If the nn subjects are independent and drawn from the same population, the sample mean y\bar{y} follows approximately 𝒩(μpred,Vpred/n)\mathcal{N}(\mu_\text{pred},\, V_\text{pred}/n), and the sample covariance SS concentrates around VpredV_\text{pred}. Under a multivariate normal approximation to the joint density of (y,S)(\bar{y}, S), the log-likelihood reduces to

2=n[log|Vpred|+tr(Vpred1S)+rVpred1r],r=yμpred. -2\ell = n \Bigl[ \log |V_\text{pred}| + \operatorname{tr}(V_\text{pred}^{-1} S) + r^\top V_\text{pred}^{-1} r \Bigr], \qquad r = \bar{y} - \mu_\text{pred}.

This is the central formula in admixr2. Each estimator minimises the same objective; they differ only in how μpred\mu_\text{pred} and VpredV_\text{pred} are computed.

The trace and quadratic terms are evaluated via Cholesky factorisation of VpredV_\text{pred}, which avoids explicit matrix inversion and is numerically stable even for near-singular covariances. The log-determinant falls out as twice the sum of log-diagonal elements of the Cholesky factor.


The integration problem

For a nonlinear ODE model, μpred\mu_\text{pred} and VpredV_\text{pred} are integrals over η𝒩(0,Ω)\eta \sim \mathcal{N}(0, \Omega). There is no closed form. The four estimators differ in how they handle this intractability: adfo linearises the integrand, admc samples it randomly, adgh integrates it on a deterministic quadrature grid, and adirmc reuses a fixed sample via importance reweighting.


First-Order linearisation (adfo)

The simplest approach approximates f(θ,η)f(\theta, \eta) by its first-order Taylor expansion around η=0\eta = 0:

f(θ,η)f(θ,0)+Jη,Jt,j=ftηj|η=0. f(\theta, \eta) \approx f(\theta, 0) + J\,\eta, \qquad J_{t,j} = \left.\frac{\partial f_t}{\partial \eta_j}\right|_{\eta=0}.

Substituting into the population moment definitions gives closed-form expressions:

μpred=f(θ,0),Vpred=JΩJ+Σ. \mu_\text{pred} = f(\theta, 0), \qquad V_\text{pred} = J\,\Omega\,J^\top + \Sigma.

The Jacobian JJ is the sensitivity matrix of the ODE output with respect to the random effects, evaluated at η=0\eta = 0. In admixr2 it is obtained by augmenting the ODE system with sensitivity equations

ddtxηj=gxxηj+gηj, \frac{d}{dt}\frac{\partial x}{\partial \eta_j} = \frac{\partial g}{\partial x}\,\frac{\partial x}{\partial \eta_j} + \frac{\partial g}{\partial \eta_j},

where gg is the ODE right-hand side. A single solver call delivers both ff and all columns of JJ simultaneously — no additional ODE solves per random-effect dimension. In the predecessor admr, this Jacobian was computed by finite differences on ff, requiring one extra ODE solve per random-effect dimension.

Gradient of the FO NLL. Because Vpred=JΩJ+ΣV_\text{pred} = J\Omega J^\top + \Sigma is explicit in Ω\Omega and Σ\Sigma, the gradient with respect to these parameters is available analytically via the matrix derivative identities dlog|A|=tr(A1dA)d \log|A| = \operatorname{tr}(A^{-1} dA) and dtr(A1B)=tr(A1dAA1B)d \operatorname{tr}(A^{-1}B) = -\operatorname{tr}(A^{-1} dA\, A^{-1} B). Structural parameter gradients use forward finite differences through the full FO NLL, reusing the sensitivity solve.

Accuracy. The FO approximation is exact when ff is linear in η\eta. For nonlinear models with large IIV the approximation underestimates Varη[f(θ,η)]\operatorname{Var}_\eta[f(\theta, \eta)]: VpredV_\text{pred} is too small and the estimated Ω\Omega is negatively biased. Practically, bias is negligible when the coefficient of variation for each PK parameter is below 20–30 % and the model is weakly nonlinear in η\eta.


Monte Carlo simulation (admc)

The MC estimator makes no approximation to ff. It draws NN random-effect samples and computes sample moments:

μ̂=1Ni=1Nf(θ,ηi),V̂=1Ni=1N(f(θ,ηi)μ̂)(f(θ,ηi)μ̂)+Σ,ηi𝒩(0,Ω). \hat{\mu} = \frac{1}{N}\sum_{i=1}^N f(\theta, \eta_i), \qquad \hat{V} = \frac{1}{N}\sum_{i=1}^N \bigl(f(\theta,\eta_i) - \hat{\mu}\bigr) \bigl(f(\theta,\eta_i) - \hat{\mu}\bigr)^\top + \Sigma, \qquad \eta_i \sim \mathcal{N}(0, \Omega).

The estimator converges to the true population moments as NN \to \infty; its AIC is directly interpretable.

Quasi-random sampling. admixr2 draws samples using Sobol sequences (also available in admr) rather than pseudo-random normal deviates. Sobol sequences are low-discrepancy: successive points are placed to fill gaps in the sample space, rather than clustering randomly. For the smooth integrands that arise in pharmacometric MC integration, this uniform coverage reduces the integration error for a given NN compared to i.i.d. draws, at the cost of no additional model evaluations.

Numerical stability. The sample covariance is computed in fused C++ kernels that accumulate centred products in a single pass, avoiding an intermediate N×TN \times T allocation. The resulting V̂\hat{V} is guaranteed positive semi-definite by construction.


Gauss-Hermite quadrature (adgh)

The MC estimator replaces the population integral with a random average. The GH estimator replaces it with a deterministic one: a tensor-product Gauss-Hermite quadrature rule that evaluates the same expectation over η𝒩(0,Ω)\eta \sim \mathcal{N}(0, \Omega) at a small set of fixed nodes. This estimator is new in admixr2 — it has no counterpart in admr.

For a single random effect, the probabilists’ Gauss-Hermite rule approximates an expectation under the standard normal by a weighted sum over mm nodes,

𝔼z𝒩(0,1)[g(z)]q=1mwqg(xq), \mathbb{E}_{z\sim\mathcal{N}(0,1)}[g(z)] \approx \sum_{q=1}^m w_q\, g(x_q),

with nodes xqx_q and weights wqw_q (normalised so qwq=1\sum_q w_q = 1 and qwqxq2=1\sum_q w_q x_q^2 = 1) obtained from the Golub-Welsch algorithm — the eigendecomposition of the symmetric tridiagonal Jacobi matrix of the Hermite recurrence, requiring no external tables. For nηn_\eta random effects the rule is applied as a tensor product, giving a grid of Q=mnηQ = m^{n_\eta} nodes xqnηx_q \in \mathbb{R}^{n_\eta} with product weights. The Cholesky factor LL (Ω=LL\Omega = LL^\top) maps the standard-normal nodes into the correlated random-effect space, ηq=Lxq\eta_q = L x_q, so the population moments become

μpred=q=1Qwqf(θ,Lxq),Vpred=q=1Qwq(f(θ,Lxq)μpred)()+Σ. \mu_\text{pred} = \sum_{q=1}^Q w_q\, f(\theta, L x_q), \qquad V_\text{pred} = \sum_{q=1}^Q w_q\, \bigl(f(\theta, L x_q) - \mu_\text{pred}\bigr) \bigl(\cdots\bigr)^\top + \Sigma.

Structurally this is the MC estimator with a fixed deterministic node grid in place of NN random draws and quadrature weights in place of uniform ones, so it plugs into exactly the same aggregate MVN 2-2\ell.

Noise-free objective. Because the nodes and weights are fixed, the objective is a smooth deterministic function of (θ,Ω,Σ)(\theta, \Omega, \Sigma) — there is no Monte Carlo noise to average away and no n_sim to tune. The analytical gradient (closed-form contractions through the same ODE sensitivity outputs used by the MC estimator) is therefore exact, and the numerical Hessian used for standard errors is well-conditioned. Like admc it makes no approximation to ff, so it is unbiased at any IIV magnitude, and its likelihood is on the same scale — its AIC is directly comparable to admc and adirmc.

Cost. The grid size Q=mnηQ = m^{n_\eta} grows exponentially in the number of random effects. For models with up to ~4 etas, QQ is small (e.g. 54=6255^4 = 625 nodes) and adgh is substantially faster than MC at equivalent accuracy; for high-dimensional IIV the node count becomes prohibitive and admc or adirmc are preferable. A modest node count (n_nodes = 5) gives near-exact covariance moments for IIV SD up to ~0.5; n_nodes = 7 extends this to ~0.7.


Iterative Reweighting MC (adirmc)

The central bottleneck of the MC estimator is that every NLL evaluation requires NN ODE solves — one per sample. For a complex model each solve is expensive, and an L-BFGS optimisation may require hundreds of NLL evaluations.

IRMC decouples sample generation from optimisation.

Proposal distribution

At each outer phase, NN proposals are drawn once from an inflated prior:

η̃i𝒩(0,αΩ0),α1, \tilde{\eta}_i \sim \mathcal{N}\!\left(0,\, \alpha\,\Omega_0\right), \qquad \alpha \geq 1,

where Ω0\Omega_0 is the current Omega estimate and α\alpha is the expansion factor. The predictions f(θ0,η̃i)f(\theta_0, \tilde{\eta}_i) are computed once and stored. For the remainder of the inner optimisation they are reused without further ODE solves.

Importance weights

When the parameter vector moves to a candidate (θ,Ω)(\theta, \Omega), the stored predictions are no longer exact. For the random-effects covariance Ω\Omega, the proposals are reweighted by the ratio of target density to proposal density:

wip(η̃iΩ)q(η̃iαΩ0)=exp[12η̃i(Ω1(αΩ0)1)η̃i]. w_i \propto \frac{p(\tilde{\eta}_i \mid \Omega)}{q(\tilde{\eta}_i \mid \alpha\,\Omega_0)} = \exp\!\left[ -\tfrac{1}{2}\tilde{\eta}_i^\top (\Omega^{-1} - (\alpha\,\Omega_0)^{-1}) \tilde{\eta}_i \right].

Weights are normalised via softmax. The weighted mean and covariance of the predictions then provide the IRMC estimates of μpred\mu_\text{pred} and VpredV_\text{pred}:

μpred=iwif(θ0,η̃i),Vpred=iwi(f(θ0,η̃i)μpred)()+Σ. \mu_\text{pred} = \sum_i w_i\, f(\theta_0, \tilde\eta_i), \qquad V_\text{pred} = \sum_i w_i\, (f(\theta_0, \tilde\eta_i) - \mu_\text{pred}) (\cdots)^\top + \Sigma.

Given fixed proposals this inner objective is a smooth, deterministic function of (θ,Ω)(\theta, \Omega) and can be optimised to high precision. Between phases, proposals are refreshed and box constraints are progressively tightened to guide global convergence. The number of ODE solves scales with the number of phases, not the number of inner optimizer steps.


Gradient computation: from finite differences to sensitivity equations

The gradient is the single most important ingredient for efficient nonlinear optimisation. The improvements in gradient computation represent the most consequential algorithmic differences between admr and admixr2.

Gradient in admr

admr stores a fixed Sobol base matrix biseq for the lifetime of an optimisation run. Random-effect samples are formed as ηi=𝚋𝚒𝚜𝚎𝚚L(Ω)\eta_i = \texttt{biseq} \cdot L(\Omega), so the NLL is a smooth deterministic function of the parameter vector for fixed biseq. The gradient for the MC estimator is computed by forward finite differences of this deterministic NLL:

θk(θ+hek)(θ)h,h=106. \frac{\partial \ell}{\partial \theta_k} \approx \frac{\ell(\theta + h\,e_k) - \ell(\theta)}{h}, \qquad h = 10^{-6}.

For a model with pp parameters, each gradient call therefore costs pp extra NLL evaluations, i.e. p×Np \times N additional ODE solves. Because this is expensive, admr defaults to gradient-free BOBYQA; L-BFGS with FD gradient is available but off by default (use_grad = FALSE).

For the IRMC estimator, proposals are fixed inside a closure before the inner optimisation begins, so each inner NLL evaluation requires only matrix operations — no ODE solves. admr’s optional FD inner gradient therefore incurs pp extra importance-weighted recomputations per step, but still no extra ODE solves. Nonetheless, the inner optimizer defaults to BOBYQA (no gradient) and the FD approximation introduces truncation error of order O(h)O(h).

CRN analytical gradient in admixr2

admixr2 fixes the sample set {ηi}\{\eta_i\} and computes the gradient of the frozen deterministic MC NLL analytically — an application of the Common Random Numbers (CRN) identity.

For a mu-referenced parameterisation ψk=exp(θk+ηk)\psi_k = \exp(\theta_k + \eta_k), the chain rule gives

μ̂θk=1Ni=1Nf(θ,ηi)ηkψkθkψkηkψk1/ψk=1Ni=1Nf(θ,ηi)ηk, \frac{\partial \hat{\mu}}{\partial \theta_k} = \frac{1}{N}\sum_{i=1}^N \frac{\partial f(\theta, \eta_i)}{\partial \eta_k} \cdot \underbrace{\frac{\partial \psi_k}{\partial \theta_k}}_{\psi_k} \cdot \underbrace{\frac{\partial \eta_k}{\partial \psi_k}}_{1/\psi_k} = \frac{1}{N}\sum_{i=1}^N \frac{\partial f(\theta, \eta_i)}{\partial \eta_k},

i.e. the sample average of the ODE sensitivity output f/ηk\partial f / \partial \eta_k — which is already available from the NLL evaluation via the augmented ODE system. No additional ODE solves are needed.

The gradient of V̂\hat{V} with respect to Ω\Omega follows similarly: the Cholesky factor LL (Ω=LL\Omega = LL^\top) enters through the sample covariance, and differentiating through the MVN chain rule yields closed-form expressions in the stored sensitivity outputs.

The net result is that the full MC gradient costs the same as a single NLL evaluation — it requires no additional ODE solves and is exact rather than a finite-difference approximation. For a model with p=10p = 10 parameters and N=5000N = 5000 samples, admr’s FD gradient requires 5000050\,000 additional ODE solves per optimizer step; admixr2’s CRN gradient requires none.

Analytical IRMC inner gradient

The importance-weighted inner objective has a tractable analytical gradient with respect to both Ω\Omega and θ\theta. The gradient passes through the softmax normalisation via the identity

piwi(p)gi=iwi(gip+logwip(gig)), \frac{\partial}{\partial p} \sum_i w_i(\,p\,)\, g_i = \sum_i w_i \left(\frac{\partial g_i}{\partial p} + \frac{\partial \log w_i}{\partial p}\,(g_i - \bar{g})\right),

where g=iwigi\bar{g} = \sum_i w_i g_i. Differentiating the multivariate normal log-likelihood of the weighted mean and covariance through this identity yields closed-form expressions in the pre-computed ODE solutions and sensitivity outputs. Since proposals are fixed, the inner gradient evaluation requires no ODE solves regardless of the implementation — the admixr2 advantage over admr’s FD approach is exactness and a p×p\times reduction in inner-level matrix operations.

The inner optimizer therefore runs L-BFGS with an exact gradient and converges in far fewer iterations than BOBYQA.

Kappa correction for non-mu-referenced parameters

Both admr and admixr2 handle models where some structural parameters βk\beta_k enter ff without a paired random effect (e.g. CL=exp(β1)exp(η1)\mathrm{CL} = \exp(\beta_1) \cdot \exp(\eta_1), but VmaxV_\text{max} has no η\eta). When such parameters change during the inner IRMC optimisation, the stored predictions f(θ0,η̃i)f(\theta_0, \tilde\eta_i) are no longer exact even for the mean. Both packages apply a kappa correction:

κ(β)=f(βnew,0)f(βorig,0), \kappa(\beta) = f(\beta_\text{new}, 0) - f(\beta_\text{orig}, 0),

shifting μpred\mu_\text{pred} by the predicted change at η=0\eta = 0.

admixr2 introduces a linearized kappa option (kappa_method = "linearized") that computes the Jacobian Jκ=f/βsingleJ_\kappa = \partial f / \partial \beta_\text{single} once per outer iteration (via a single batched FD solve) and approximates the correction as Jκ(βnewβorig)J_\kappa (\beta_\text{new} - \beta_\text{orig}) during the inner loop — a pure matrix operation requiring no ODE solver calls. This is particularly valuable for complex ODE systems where each kappa evaluation is otherwise expensive: with the exact kappa, every inner NLL step requires one extra rxSolve for the single-beta parameters; with linearized kappa, the inner loop is completely free of ODE solver calls.


Parameter space geometry

Cholesky parameterisation of Ω\Omega

The between-subject covariance Ω\Omega must be positive definite. Unconstrained optimisation in the space of symmetric matrices can produce non-positive-definite steps.

Both admr and admixr2 parameterise Ω\Omega through a Cholesky factor LL (Ω=LL\Omega = LL^\top). The packages differ in how the Cholesky entries are encoded. admr uses a log transform for diagonal entries and a covlogit transform for off-diagonal entries (the correlation corresponding to each LjkL_{jk} is mapped through a logit after normalisation). admixr2 uses:

pk=log(Ωkk)=2log(Lkk)(diagonal),pjk=Ljk(off-diagonal, raw). p_k = \log(\Omega_{kk}) = 2\log(L_{kk}) \quad\text{(diagonal)}, \qquad p_{jk} = L_{jk} \quad\text{(off-diagonal, raw)}.

The raw parameterisation simplifies the gradient computation substantially: the Jacobian Ω/pjk\partial \Omega / \partial p_{jk} is a simple rank-1 matrix with no logit derivative to chain through. It also avoids the numerical instability of logit-based transforms when correlations approach ±1.

The specific encoding p=log(Ωkk)p = \log(\Omega_{kk}) rather than p=log(Lkk)p = \log(L_{kk}) equalises gradient sensitivity across parameter types: a unit optimizer step changes Ωkk\Omega_{kk} by a factor of e2e^2, matching the sensitivity of structural parameters on the log scale.

Parameter preconditioning

The unconstrained parameter vector is pre-scaled by a diagonal matrix CC before being passed to L-BFGS. Each diagonal entry estimates the curvature in that direction: unity for exponential transforms, a derivative-based magnitude for logit/probit transforms, and max(|Ljk|,0.1)\max(|L_{jk}|, 0.1) for off-diagonal Cholesky entries. Preconditioning reduces condition number and accelerates convergence when structural parameters and variance components differ by orders of magnitude.


Summary of improvements over admr

admr (van de Beek et al., 2025) established the core aggregate data workflow in R. admixr2 preserves the same three estimators (FO, MC, IRMC) and the same aggregate MVN likelihood — adding a fourth, deterministic Gauss-Hermite estimator — while replacing the surrounding architecture and several algorithmic components.

Aspect admr admixr2
Ecosystem Standalone R package nlmixr2/rxode2 integration
Model syntax Custom genopts() + prediction function nlmixr2 ini() / model() blocks
Fit object Plain list nlmixr2 fit object (AIC, logLik, plot, print)
Estimators FO, MC, IRMC FO, MC, GH, IRMC
Gauss-Hermite quadrature estimator Not available New deterministic, noise-free estimator
FO Jacobian (f/η\partial f/\partial\eta) C++ finite differences ODE sensitivity equations
MC gradient FD of frozen MC NLL (p×Np \times N extra solves) CRN analytical (0 extra ODE solves)
IRMC inner optimizer BOBYQA (default) or FD gradient L-BFGS with analytical gradient
Linearized kappa Not available Optional; eliminates ODE calls in inner loop
Ω\Omega off-diagonal encoding covlogit (correlation logit) Raw Cholesky entry LjkL_{jk}
Parameter preconditioning No Yes (diagonal scaling)
Parallel restarts Sequential chains furrr workers

The gradient improvements are the most consequential. Moving from a gradient that requires p×Np \times N extra ODE solves per step (admr FD) to one that requires none (admixr2 CRN) reduces wall-clock fitting time by one to two orders of magnitude for gradient-based algorithms on complex models. The analytical IRMC inner gradient compounds this saving: the inner optimisation converges in fewer iterations, each iteration is faster, and the linearized kappa option allows the entire inner loop to run without any ODE solver calls.


Further reading

The mathematical foundations are described in detail in the papers linked at the top of this post. The Estimator comparison vignette shows worked examples on the included examplomycin dataset with all four estimators. The Advanced usage vignette covers gradient modes, multi-restart fitting, and model comparison via AIC.