Design decisions

This document records the key design decisions in meridian-tools and the reasoning behind them. It is intended for maintainers and contributors who need to understand why things are built the way they are.

No IID cross-validation

Decision: meridian-tools does not implement random-shuffle or naive k-fold cross-validation.

Reasoning: MMM data is time series. Random IID splits break temporal structure, leading to data leakage where future observations inform training and past observations appear in the test set. This produces optimistic accuracy estimates that do not reflect real-world forecasting performance.

The package provides two time-respecting alternatives:

  • Blocked tail — reserves the most recent observations as a single test block.
  • Rolling origin — expanding-window forward-chaining that respects temporal ordering at every split.

Non-overlapping rolling-origin test windows

Decision: step_size must equal test_size for rolling-origin splits.

Reasoning: Overlapping test windows would mean the same observation appears in multiple test sets. This violates the independence assumption needed for comparing validation scores across splits and complicates the interpretation of aggregate metrics. Non-overlapping windows ensure each observation is evaluated exactly once across the split plan.

Minimum two splits for rolling origin

Decision: build_rolling_origin_splits requires at least two splits.

Reasoning: A single rolling-origin split is functionally identical to a blocked-tail holdout and provides no comparative signal. If your data only supports one split, use blocked_tail instead — it communicates the intent more clearly.

Holdout restriction for model selection

Decision: LOO and WAIC are only available for models where holdout_id is None.

Reasoning: LOO and WAIC estimate expected log predictive density (ELPD) using the full observed likelihood surface. A model fitted with a holdout mask has a modified likelihood that excludes held-out observations. Computing LOO on this truncated likelihood would produce ELPD estimates that are not comparable to those from full-sample fits.

The correct workflow is:

  1. Use validation splits for candidate evaluation.
  2. Select the best specification based on holdout performance.
  3. Refit the chosen specification on the full dataset.
  4. Compute LOO/WAIC on the full-sample fit for model quality reporting.

Separation of validation fits and final fits

Decision: Validation runs and final production fits are separate pipeline executions that produce separate run directories.

Reasoning: A validation fit is trained on a subset of the data. Its posterior reflects that subset and should not be used as the production artefact. Keeping them as separate runs prevents accidental use of a validation fit for downstream analysis or reporting.

Lazy imports for CLI responsiveness

Decision: Heavy dependencies (TensorFlow, NumPy, Meridian, ArviZ) are not imported at module level in the config, CLI, or validation layers.

Reasoning: TensorFlow alone takes several seconds to import. The CLI must respond instantly for --help and --list operations. The __init__.py uses __getattr__-based lazy loading, and the test suite verifies that build_parser() only loads pydantic and yaml.

Pydantic extra="forbid" everywhere

Decision: All configuration models reject unexpected keys.

Reasoning: Silent acceptance of unknown keys is a common source of misconfiguration in YAML-driven tools. A typo like export_pridictive_accuracy would be silently ignored without extra="forbid", leading to unexpected default behaviour. Strict rejection catches these errors at config load time with clear error messages.

Relative artefact paths in manifests

Decision: All artefact paths in run_manifest.json are stored relative to the run directory.

Reasoning: Absolute paths would tie run directories to a specific machine or filesystem layout. Relative paths make run directories portable — they can be copied, archived, or moved between machines without breaking the manifest contract.

Non-destructive lifecycle operations

Decision: refresh_run creates a new sibling directory rather than overwriting the source.

Reasoning: Overwriting a validated production run would destroy the audit trail. Creating a sibling preserves the original for comparison and rollback. The lifecycle layer explicitly validates that source directories are not mutated by refresh operations.

Manifest-per-stage persistence

Decision: The manifest is written to disk after each stage completes, not only at the end of the pipeline.

Reasoning: MCMC sampling can run for minutes to hours. If the process crashes or is killed during a later stage, the partial manifest on disk reflects what completed successfully. This aids debugging and allows partial runs to be inspected without special tooling.

Stage numbering with gaps

Decision: Pipeline stages use numbers 00, 10, 20, 30, 40, 60, 70 with a gap at 50.

Reasoning: The gaps allow future stages to be inserted at natural positions (e.g. a stage 50 for custom analysis) without renumbering existing stages. Renumbering would break backward compatibility with stored manifests and any downstream tooling that references stage names.

Config source vs. resolved archival

Decision: Both the verbatim source YAML and the resolved YAML are archived in every run directory.

Reasoning: The source YAML shows what the analyst authored (including relative paths). The resolved YAML shows the runtime interpretation (absolute paths, defaults applied). Both are needed for reproducibility:

  • The source is needed to understand intent.
  • The resolved config is needed to reproduce the exact execution.

Runtime-only fields (output_dir, run_name, validation_spec) are deliberately excluded from the resolved config because they are not part of the reproducible model specification.

Structured model selection errors

Decision: Model selection failures produce ModelSelectionError with a machine-readable reason_code rather than generic exceptions.

Reasoning: The pipeline needs to distinguish between “model selection is not possible for this run type” (expected) and “something is broken” (unexpected). Structured reason codes allow:

  • The runner to write model_selection_status.json without failing the run.
  • The lifecycle layer to compare model selection availability across runs.
  • Downstream consumers to programmatically handle different failure modes.