Model selection guide

This guide explains how meridian-tools supports Bayesian model selection using Leave-One-Out (LOO) cross-validation and the Watanabe-Akaike Information Criterion (WAIC). It covers when model selection is available, how to interpret the outputs, and how to compare multiple candidate models.

What model selection provides

Bayesian model selection uses information criteria computed from pointwise log-likelihood values to compare model specifications. Unlike predictive accuracy on a held-out set, LOO and WAIC evaluate the model’s expected predictive performance using the full posterior without requiring a separate validation split.

meridian-tools wraps ArviZ’s az.loo and az.waic with:

Automatic log-likelihood reconstruction for fitted Meridian models
Structured error handling when model selection is not possible
A compare_models surface for ranking multiple candidates
Artefact-level compatibility status in every run directory

For the broader rationale, see why meridian-tools exists. The short version is that model choice needs out-of-sample evidence, not only in-sample fit summaries.

Compatibility boundary

Model selection is only available for models where holdout_id is None. This means:

Run type	Model selection available
Full-sample fit (no validation)	Yes
Final-fit run (`mode: final_fit`)	Yes
Blocked-tail validation run	No
Rolling-origin validation split	No
Authored-holdout run	No
Bare `InferenceData` without `log_likelihood`	No

This restriction exists because LOO and WAIC require the full observed likelihood surface. A holdout fit has a modified likelihood that does not represent the full data generating process. Comparing a holdout fit’s ELPD against a full fit’s ELPD would be statistically meaningless.

How it works in the pipeline

When exports.export_model_selection: true in the YAML config, the runner’s 30_model_assessment stage attempts model selection after writing diagnostics.

Compatible runs

For compatible models, the stage writes:

loo_summary.json — LOO summary statistics (ELPD, p_loo, SE, etc.)
waic_summary.json — WAIC summary statistics
loo_pointwise.csv — Per-observation LOO values and Pareto k diagnostics
waic_pointwise.csv — Per-observation WAIC values
model_comparison.csv — Ranked comparison table (single-model for individual runs)
model_selection_warnings.json — Warning category/message/step and result flags when LOO, WAIC, or comparison emits warnings

Unavailable or degraded runs

For incompatible models or unexpected model-selection runtime/export failures, the stage writes a single status artefact:

model_selection_status.json

{
  "status": "unavailable",
  "reason_code": "holdout_fit_unsupported",
  "reason": "Model selection requires holdout_id is None ...",
  "warnings": []
}

Known reason codes:

Code	Meaning
`holdout_fit_unsupported`	The model was fitted with a holdout mask
`requires_fitted_meridian_model`	Missing posterior samples or ArviZ `InferenceData`
`missing_log_likelihood_group`	Bare `InferenceData` without reconstructable likelihood
`meridian_internal_seam_incompatible`	Meridian version lacks required internal reconstruction methods
`arviz_runtime_error`	ArviZ raised an unexpected runtime/value error
`export_runtime_error`	Writing model-selection JSON/CSV artefacts failed

Model-selection unavailability is non-fatal. The pipeline completes successfully and records the reason in the artefact.

Using the Python API directly

Compute LOO for a single model

from meridian_tools.model_selection import compute_loo

result = compute_loo(fitted_model, pointwise=True)

print(result.kind)          # "loo"
print(result.summary)       # {"kind": "loo", "elpd_loo": -123.4, ...}
print(result.pointwise)     # DataFrame with loo_i, pareto_k per observation

Compute WAIC for a single model

from meridian_tools.model_selection import compute_waic

result = compute_waic(fitted_model, pointwise=True)

print(result.kind)          # "waic"
print(result.summary)       # {"kind": "waic", "elpd_waic": -125.1, ...}

Compare multiple models

from meridian_tools.model_selection import compare_models

comparison = compare_models(
    {
        "model_a": fitted_model_a,
        "model_b": fitted_model_b,
    },
    ic="loo",   # or "waic"
)

print(comparison)
# DataFrame with columns: model, rank, elpd_loo, p_loo, elpd_diff, weight, se, dse, warning, scale

The comparison table is ranked by ELPD. The best model has rank 0 and elpd_diff == 0. The weight column gives stacking weights.

Worked comparison

The example below uses two small ArviZ InferenceData objects with log_likelihood groups. A fitted Meridian model follows the same path after meridian-tools reconstructs its pointwise log likelihood.

import arviz as az
import numpy as np

from meridian_tools.model_selection import compare_models


def idata_with_log_likelihood(seed: int) -> az.InferenceData:
    rng = np.random.default_rng(seed)
    return az.from_dict(
        posterior={"theta": rng.normal(size=(2, 200, 1))},
        log_likelihood={"y": rng.normal(loc=-1.0, scale=0.2, size=(2, 200, 8))},
    )


comparison = compare_models(
    {
        "baseline": idata_with_log_likelihood(1),
        "candidate": idata_with_log_likelihood(2),
    },
    ic="loo",
)
print(comparison)

Representative output:

model	rank	elpd_loo	p_loo	elpd_diff	weight	se	dse	warning	scale
baseline	0	-8.16	0.33	0.00	1.00	0.02	0.00	False	log
candidate	1	-8.18	0.32	0.02	0.00	0.03	0.04	False	log

The candidate has lower expected predictive performance in this example, but the difference is small relative to dse. That does not support a strong preference on predictive grounds. In that situation, prefer the simpler or more interpretable specification, or gather more evidence.

The stored geo-panel demo includes the same output schema in runs/demos/demo-geo-panel_20260424_172854/30_model_assessment/model_comparison.csv. That demo is a single-model run, so its elpd_diff and dse are both zero.

Check log-likelihood availability

from meridian_tools.model_selection import has_log_likelihood

if has_log_likelihood(fitted_model):
    result = compute_loo(fitted_model)

Log-likelihood reconstruction

Meridian does not store pointwise log-likelihood in its InferenceData by default. meridian-tools reconstructs it automatically when you pass a fitted Meridian model to compute_loo, compute_waic, or compare_models.

The reconstruction:

Recovers unsaved posterior parameters (e.g. geo deviations, tau_g)
Rebuilds the joint distribution from the posterior samples
Computes observation-level log-likelihood
Returns a new InferenceData with the log_likelihood group attached

The original model is never mutated. The reconstruction produces a temporary copy used only for the ArviZ computation.

You can also control this explicitly:

from meridian_tools.log_likelihood import attach_log_likelihood

# Returns new InferenceData with log_likelihood group (original unchanged)
idata_with_ll = attach_log_likelihood(fitted_model, in_place=False)

# Mutates the model's inference_data in place
attach_log_likelihood(fitted_model, in_place=True)

Interpreting the outputs

LOO summary

Field	Meaning
`elpd_loo`	Expected log pointwise predictive density (higher is better)
`p_loo`	Effective number of parameters
`se`	Standard error of `elpd_loo`
`warning`	Whether Pareto k diagnostics indicate unreliable estimates

WAIC summary

Field	Meaning
`elpd_waic`	Expected log pointwise predictive density (WAIC estimate)
`p_waic`	Effective number of parameters (WAIC estimate)
`se`	Standard error of `elpd_waic`
`warning`	Whether posterior variance diagnostics indicate unreliable estimates

Pareto k diagnostics

The pointwise LOO output includes a pareto_k column. ArviZ uses Pareto k to diagnose whether the PSIS-LOO approximation is reliable for each observation and sets the summary warning flag when its reliability checks fail. meridian-tools surfaces those values and warnings; it does not currently add its own separate thresholding policy.

Model comparison

When comparing two or more models:

elpd_diff — Difference in ELPD from the best model (0 for the best)
dse — Standard error of the ELPD difference
weight — Stacking weight (how much to trust each model)
Models are ranked by ELPD (rank 0 is best)

A single-model comparison returns a one-row table with rank=0, elpd_diff=0, and weight=1.0.

Error handling

All model-selection errors are raised as ModelSelectionError with a structured reason_code:

from meridian_tools.model_selection import ModelSelectionError, compute_loo

try:
    result = compute_loo(candidate)
except ModelSelectionError as exc:
    print(exc.reason_code)  # e.g. "holdout_fit_unsupported"
    print(str(exc))         # Human-readable explanation

In the pipeline, these errors are caught and written to model_selection_status.json rather than failing the run.