
The mathematics of aggregate data modelling
H. van de Beek
2026-05-29
Source:vignettes/articles/introducing-admixr2.Rmd
introducing-admixr2.RmdWhy aggregate data?
Meta-analysis in pharmacometrics means combining evidence across studies to obtain parameter estimates that no single study could support on its own. The classical approach pools individual patient data (IPD) from multiple studies into one dataset. This is scientifically ideal but practically rare: data sharing agreements, proprietary restrictions, and the simple fact that most published work reports only summary statistics make IPD pooling the exception rather than the rule.
Model-based meta-analysis (MBMA) addresses this by fitting pharmacometric models directly to aggregate summaries. These summaries come from two distinct sources. The first is directly published: many clinical studies report a mean concentration-time profile together with the variance at each time point, and sometimes the full covariance matrix. The second is model-derived: a published pharmacometric model with its parameter estimates (fixed effects, random-effects variance, residual error) can be used to generate the expected mean vector and covariance matrix via simulation, without access to the original data. The covariance structure carries information about the shape of individual profiles that a variance-only summary would discard — two studies can have identical variances at each time point yet very different correlation structures, reflecting different sources of variability. Both types of aggregate data, whether published directly or derived from a published model, can be analysed within the same framework. This makes the full literature — competitor compounds, historical datasets, regulatory submissions — available as modelling input for dose selection, trial design, and disease-progression modelling.
The aggregate data likelihood, first systematically described by Välitalo (2021), provides the statistical foundation for MBMA: given a nonlinear mixed-effects model, what is the probability of observing the reported mean and covariance? The first R implementation was admr, described in van de Beek et al. (2025). admixr2 re-implements the same approach within the nlmixr2/rxode2 ecosystem, extending it with new algorithmic ideas that are the focus of this post.
The aggregate likelihood
Observed data
For a single study with subjects, each measured at observation times , the observed data are two sufficient statistics:
where is the observed mean vector and is the observed covariance matrix (ML denominator , not ).
The model
A nonlinear mixed-effects model specifies that subject ’s observations arise from
where is an ODE system (or closed-form prediction function) with structural parameters and individual random effects . is the between-subject covariance, is the residual error covariance (typically a diagonal function of the structural prediction, e.g. proportional error).
The aggregate log-likelihood
If the subjects are independent and drawn from the same population, the sample mean follows approximately , and the sample covariance concentrates around . Under a multivariate normal approximation to the joint density of , the log-likelihood reduces to
This is the central formula in admixr2. Each estimator minimises the same objective; they differ only in how and are computed.
The trace and quadratic terms are evaluated via Cholesky factorisation of , which avoids explicit matrix inversion and is numerically stable even for near-singular covariances. The log-determinant falls out as twice the sum of log-diagonal elements of the Cholesky factor.
The integration problem
For a nonlinear ODE model,
and
are integrals over
.
There is no closed form. The four estimators differ in how they handle
this intractability: adfo linearises the integrand,
admc samples it randomly, adgh integrates it
on a deterministic quadrature grid, and adirmc reuses a
fixed sample via importance reweighting.
First-Order linearisation (adfo)
The simplest approach approximates by its first-order Taylor expansion around :
Substituting into the population moment definitions gives closed-form expressions:
The Jacobian is the sensitivity matrix of the ODE output with respect to the random effects, evaluated at . In admixr2 it is obtained by augmenting the ODE system with sensitivity equations
where is the ODE right-hand side. A single solver call delivers both and all columns of simultaneously — no additional ODE solves per random-effect dimension. In the predecessor admr, this Jacobian was computed by finite differences on , requiring one extra ODE solve per random-effect dimension.
Gradient of the FO NLL. Because is explicit in and , the gradient with respect to these parameters is available analytically via the matrix derivative identities and . Structural parameter gradients use forward finite differences through the full FO NLL, reusing the sensitivity solve.
Accuracy. The FO approximation is exact when is linear in . For nonlinear models with large IIV the approximation underestimates : is too small and the estimated is negatively biased. Practically, bias is negligible when the coefficient of variation for each PK parameter is below 20–30 % and the model is weakly nonlinear in .
Monte Carlo simulation (admc)
The MC estimator makes no approximation to . It draws random-effect samples and computes sample moments:
The estimator converges to the true population moments as ; its AIC is directly interpretable.
Quasi-random sampling. admixr2 draws samples using Sobol sequences (also available in admr) rather than pseudo-random normal deviates. Sobol sequences are low-discrepancy: successive points are placed to fill gaps in the sample space, rather than clustering randomly. For the smooth integrands that arise in pharmacometric MC integration, this uniform coverage reduces the integration error for a given compared to i.i.d. draws, at the cost of no additional model evaluations.
Numerical stability. The sample covariance is computed in fused C++ kernels that accumulate centred products in a single pass, avoiding an intermediate allocation. The resulting is guaranteed positive semi-definite by construction.
Gauss-Hermite quadrature (adgh)
The MC estimator replaces the population integral with a random average. The GH estimator replaces it with a deterministic one: a tensor-product Gauss-Hermite quadrature rule that evaluates the same expectation over at a small set of fixed nodes. This estimator is new in admixr2 — it has no counterpart in admr.
For a single random effect, the probabilists’ Gauss-Hermite rule approximates an expectation under the standard normal by a weighted sum over nodes,
with nodes and weights (normalised so and ) obtained from the Golub-Welsch algorithm — the eigendecomposition of the symmetric tridiagonal Jacobi matrix of the Hermite recurrence, requiring no external tables. For random effects the rule is applied as a tensor product, giving a grid of nodes with product weights. The Cholesky factor () maps the standard-normal nodes into the correlated random-effect space, , so the population moments become
Structurally this is the MC estimator with a fixed deterministic node grid in place of random draws and quadrature weights in place of uniform ones, so it plugs into exactly the same aggregate MVN .
Noise-free objective. Because the nodes and weights
are fixed, the objective is a smooth deterministic function of
— there is no Monte Carlo noise to average away and no
n_sim to tune. The analytical gradient (closed-form
contractions through the same ODE sensitivity outputs used by the MC
estimator) is therefore exact, and the numerical Hessian used for
standard errors is well-conditioned. Like admc it makes no
approximation to
,
so it is unbiased at any IIV magnitude, and its likelihood is on the
same scale — its AIC is directly comparable to admc and
adirmc.
Cost. The grid size
grows exponentially in the number of random effects. For models with up
to ~4 etas,
is small
(e.g.
nodes) and adgh is substantially faster than MC at
equivalent accuracy; for high-dimensional IIV the node count becomes
prohibitive and admc or adirmc are preferable.
A modest node count (n_nodes = 5) gives near-exact
covariance moments for IIV SD up to ~0.5; n_nodes = 7
extends this to ~0.7.
Iterative Reweighting MC (adirmc)
The central bottleneck of the MC estimator is that every NLL evaluation requires ODE solves — one per sample. For a complex model each solve is expensive, and an L-BFGS optimisation may require hundreds of NLL evaluations.
IRMC decouples sample generation from optimisation.
Proposal distribution
At each outer phase, proposals are drawn once from an inflated prior:
where is the current Omega estimate and is the expansion factor. The predictions are computed once and stored. For the remainder of the inner optimisation they are reused without further ODE solves.
Importance weights
When the parameter vector moves to a candidate , the stored predictions are no longer exact. For the random-effects covariance , the proposals are reweighted by the ratio of target density to proposal density:
Weights are normalised via softmax. The weighted mean and covariance of the predictions then provide the IRMC estimates of and :
Given fixed proposals this inner objective is a smooth, deterministic function of and can be optimised to high precision. Between phases, proposals are refreshed and box constraints are progressively tightened to guide global convergence. The number of ODE solves scales with the number of phases, not the number of inner optimizer steps.
Gradient computation: from finite differences to sensitivity equations
The gradient is the single most important ingredient for efficient nonlinear optimisation. The improvements in gradient computation represent the most consequential algorithmic differences between admr and admixr2.
Gradient in admr
admr stores a fixed Sobol base matrix biseq for the
lifetime of an optimisation run. Random-effect samples are formed as
, so the NLL is a smooth
deterministic function of the parameter vector for fixed
biseq. The gradient for the MC estimator is computed by
forward finite differences of this deterministic NLL:
For a model with
parameters, each gradient call therefore costs
extra NLL evaluations,
i.e.
additional ODE solves. Because this is expensive, admr defaults to
gradient-free BOBYQA; L-BFGS with FD gradient is available but off by
default (use_grad = FALSE).
For the IRMC estimator, proposals are fixed inside a closure before the inner optimisation begins, so each inner NLL evaluation requires only matrix operations — no ODE solves. admr’s optional FD inner gradient therefore incurs extra importance-weighted recomputations per step, but still no extra ODE solves. Nonetheless, the inner optimizer defaults to BOBYQA (no gradient) and the FD approximation introduces truncation error of order .
CRN analytical gradient in admixr2
admixr2 fixes the sample set and computes the gradient of the frozen deterministic MC NLL analytically — an application of the Common Random Numbers (CRN) identity.
For a mu-referenced parameterisation , the chain rule gives
i.e. the sample average of the ODE sensitivity output — which is already available from the NLL evaluation via the augmented ODE system. No additional ODE solves are needed.
The gradient of with respect to follows similarly: the Cholesky factor () enters through the sample covariance, and differentiating through the MVN chain rule yields closed-form expressions in the stored sensitivity outputs.
The net result is that the full MC gradient costs the same as a single NLL evaluation — it requires no additional ODE solves and is exact rather than a finite-difference approximation. For a model with parameters and samples, admr’s FD gradient requires additional ODE solves per optimizer step; admixr2’s CRN gradient requires none.
Analytical IRMC inner gradient
The importance-weighted inner objective has a tractable analytical gradient with respect to both and . The gradient passes through the softmax normalisation via the identity
where . Differentiating the multivariate normal log-likelihood of the weighted mean and covariance through this identity yields closed-form expressions in the pre-computed ODE solutions and sensitivity outputs. Since proposals are fixed, the inner gradient evaluation requires no ODE solves regardless of the implementation — the admixr2 advantage over admr’s FD approach is exactness and a reduction in inner-level matrix operations.
The inner optimizer therefore runs L-BFGS with an exact gradient and converges in far fewer iterations than BOBYQA.
Kappa correction for non-mu-referenced parameters
Both admr and admixr2 handle models where some structural parameters enter without a paired random effect (e.g. , but has no ). When such parameters change during the inner IRMC optimisation, the stored predictions are no longer exact even for the mean. Both packages apply a kappa correction:
shifting by the predicted change at .
admixr2 introduces a linearized kappa option
(kappa_method = "linearized") that computes the Jacobian
once per outer iteration (via a single batched FD solve) and
approximates the correction as
during the inner loop — a pure matrix operation requiring no ODE solver
calls. This is particularly valuable for complex ODE systems where each
kappa evaluation is otherwise expensive: with the exact kappa, every
inner NLL step requires one extra rxSolve for the single-beta
parameters; with linearized kappa, the inner loop is completely free of
ODE solver calls.
Parameter space geometry
Cholesky parameterisation of
The between-subject covariance must be positive definite. Unconstrained optimisation in the space of symmetric matrices can produce non-positive-definite steps.
Both admr and admixr2 parameterise through a Cholesky factor (). The packages differ in how the Cholesky entries are encoded. admr uses a log transform for diagonal entries and a covlogit transform for off-diagonal entries (the correlation corresponding to each is mapped through a logit after normalisation). admixr2 uses:
The raw parameterisation simplifies the gradient computation substantially: the Jacobian is a simple rank-1 matrix with no logit derivative to chain through. It also avoids the numerical instability of logit-based transforms when correlations approach ±1.
The specific encoding rather than equalises gradient sensitivity across parameter types: a unit optimizer step changes by a factor of , matching the sensitivity of structural parameters on the log scale.
Parameter preconditioning
The unconstrained parameter vector is pre-scaled by a diagonal matrix before being passed to L-BFGS. Each diagonal entry estimates the curvature in that direction: unity for exponential transforms, a derivative-based magnitude for logit/probit transforms, and for off-diagonal Cholesky entries. Preconditioning reduces condition number and accelerates convergence when structural parameters and variance components differ by orders of magnitude.
Summary of improvements over admr
admr (van de Beek et al., 2025) established the core aggregate data workflow in R. admixr2 preserves the same three estimators (FO, MC, IRMC) and the same aggregate MVN likelihood — adding a fourth, deterministic Gauss-Hermite estimator — while replacing the surrounding architecture and several algorithmic components.
| Aspect | admr | admixr2 |
|---|---|---|
| Ecosystem | Standalone R package | nlmixr2/rxode2 integration |
| Model syntax | Custom genopts() + prediction function |
nlmixr2 ini() / model() blocks |
| Fit object | Plain list | nlmixr2 fit object (AIC, logLik, plot, print) |
| Estimators | FO, MC, IRMC | FO, MC, GH, IRMC |
| Gauss-Hermite quadrature estimator | Not available | New deterministic, noise-free estimator |
| FO Jacobian () | C++ finite differences | ODE sensitivity equations |
| MC gradient | FD of frozen MC NLL ( extra solves) | CRN analytical (0 extra ODE solves) |
| IRMC inner optimizer | BOBYQA (default) or FD gradient | L-BFGS with analytical gradient |
| Linearized kappa | Not available | Optional; eliminates ODE calls in inner loop |
| off-diagonal encoding | covlogit (correlation logit) | Raw Cholesky entry |
| Parameter preconditioning | No | Yes (diagonal scaling) |
| Parallel restarts | Sequential chains |
furrr workers |
The gradient improvements are the most consequential. Moving from a gradient that requires extra ODE solves per step (admr FD) to one that requires none (admixr2 CRN) reduces wall-clock fitting time by one to two orders of magnitude for gradient-based algorithms on complex models. The analytical IRMC inner gradient compounds this saving: the inner optimisation converges in fewer iterations, each iteration is faster, and the linearized kappa option allows the entire inner loop to run without any ODE solver calls.
Further reading
The mathematical foundations are described in detail in the papers
linked at the top of this post. The Estimator comparison vignette
shows worked examples on the included examplomycin dataset
with all four estimators. The Advanced
usage vignette covers gradient modes, multi-restart fitting, and
model comparison via AIC.