Design decisions
This document records the key design decisions in meridian-tools and the
reasoning behind them. It is intended for maintainers and contributors who
need to understand why things are built the way they are.
No IID cross-validation
Decision: meridian-tools does not implement random-shuffle or naive k-fold
cross-validation.
Reasoning: MMM data is time series. Random IID splits break temporal structure, leading to data leakage where future observations inform training and past observations appear in the test set. This produces optimistic accuracy estimates that do not reflect real-world forecasting performance.
The package provides two time-respecting alternatives:
- Blocked tail — reserves the most recent observations as a single test block.
- Rolling origin — expanding-window forward-chaining that respects temporal ordering at every split.
Non-overlapping rolling-origin test windows
Decision: step_size must equal test_size for rolling-origin splits.
Reasoning: Overlapping test windows would mean the same observation appears in multiple test sets. This violates the independence assumption needed for comparing validation scores across splits and complicates the interpretation of aggregate metrics. Non-overlapping windows ensure each observation is evaluated exactly once across the split plan.
Minimum two splits for rolling origin
Decision: build_rolling_origin_splits requires at least two splits.
Reasoning: A single rolling-origin split is functionally identical to a
blocked-tail holdout and provides no comparative signal. If your data only
supports one split, use blocked_tail instead — it communicates the intent
more clearly.
Holdout restriction for model selection
Decision: LOO and WAIC are only available for models where
holdout_id is None.
Reasoning: LOO and WAIC estimate expected log predictive density (ELPD) using the full observed likelihood surface. A model fitted with a holdout mask has a modified likelihood that excludes held-out observations. Computing LOO on this truncated likelihood would produce ELPD estimates that are not comparable to those from full-sample fits.
The correct workflow is:
- Use validation splits for candidate evaluation.
- Select the best specification based on holdout performance.
- Refit the chosen specification on the full dataset.
- Compute LOO/WAIC on the full-sample fit for model quality reporting.
Separation of validation fits and final fits
Decision: Validation runs and final production fits are separate pipeline executions that produce separate run directories.
Reasoning: A validation fit is trained on a subset of the data. Its posterior reflects that subset and should not be used as the production artefact. Keeping them as separate runs prevents accidental use of a validation fit for downstream analysis or reporting.
Lazy imports for CLI responsiveness
Decision: Heavy dependencies (TensorFlow, NumPy, Meridian, ArviZ) are not imported at module level in the config, CLI, or validation layers.
Reasoning: TensorFlow alone takes several seconds to import. The CLI must
respond instantly for --help and --list operations. The __init__.py uses
__getattr__-based lazy loading, and the test suite verifies that
build_parser() only loads pydantic and yaml.
Pydantic extra="forbid" everywhere
Decision: All configuration models reject unexpected keys.
Reasoning: Silent acceptance of unknown keys is a common source of
misconfiguration in YAML-driven tools. A typo like export_pridictive_accuracy
would be silently ignored without extra="forbid", leading to unexpected
default behaviour. Strict rejection catches these errors at config load time
with clear error messages.
Relative artefact paths in manifests
Decision: All artefact paths in run_manifest.json are stored relative to
the run directory.
Reasoning: Absolute paths would tie run directories to a specific machine or filesystem layout. Relative paths make run directories portable — they can be copied, archived, or moved between machines without breaking the manifest contract.
Non-destructive lifecycle operations
Decision: refresh_run creates a new sibling directory rather than
overwriting the source.
Reasoning: Overwriting a validated production run would destroy the audit trail. Creating a sibling preserves the original for comparison and rollback. The lifecycle layer explicitly validates that source directories are not mutated by refresh operations.
Manifest-per-stage persistence
Decision: The manifest is written to disk after each stage completes, not only at the end of the pipeline.
Reasoning: MCMC sampling can run for minutes to hours. If the process crashes or is killed during a later stage, the partial manifest on disk reflects what completed successfully. This aids debugging and allows partial runs to be inspected without special tooling.
Stage numbering with gaps
Decision: Pipeline stages use numbers 00, 10, 20, 30, 40, 60, 70 with a gap at 50.
Reasoning: The gaps allow future stages to be inserted at natural positions (e.g. a stage 50 for custom analysis) without renumbering existing stages. Renumbering would break backward compatibility with stored manifests and any downstream tooling that references stage names.
Config source vs. resolved archival
Decision: Both the verbatim source YAML and the resolved YAML are archived in every run directory.
Reasoning: The source YAML shows what the analyst authored (including relative paths). The resolved YAML shows the runtime interpretation (absolute paths, defaults applied). Both are needed for reproducibility:
- The source is needed to understand intent.
- The resolved config is needed to reproduce the exact execution.
Runtime-only fields (output_dir, run_name, validation_spec) are
deliberately excluded from the resolved config because they are not part of
the reproducible model specification.
Structured model selection errors
Decision: Model selection failures produce ModelSelectionError with a
machine-readable reason_code rather than generic exceptions.
Reasoning: The pipeline needs to distinguish between “model selection is not possible for this run type” (expected) and “something is broken” (unexpected). Structured reason codes allow:
- The runner to write
model_selection_status.jsonwithout failing the run. - The lifecycle layer to compare model selection availability across runs.
- Downstream consumers to programmatically handle different failure modes.