Metrics (M1–M8)

The suite reports eight metrics. The framework is uniform — per-workload scores with bootstrap CIs — while each metric targets a different axis. The authoritative formulas are in docs/03-eval-metrics.md.

Metric	Measures	Notes
`M1`	Recall under budget	last-N turns, N derived from `budget_tokens`; the recall baseline.
`M2`	(see metrics doc)	—
`M3`	Token-budget efficiency	tokenizer pinned to `cl100k_base` for reproducibility.
`M4`	(see metrics doc)	—
`M5`	(see metrics doc)	—
`M6`	Temporal consistency	`as_of`-aware minus `as_of`-naive recall; positive = improvement.
`M7`	(see metrics doc)	—
`M8`	Conformance (`conformance_rate`)	fraction of applicable conformance assertions that pass; reference store = 1.000.

note

M2/M4/M5/M7 definitions are intentionally not reproduced here to avoid drift — pull them from docs/03-eval-metrics.md, which is canonical.