Metrics (M1–M8)
The suite reports eight metrics. The framework is uniform — per-workload scores with bootstrap CIs — while each metric targets a different axis. The authoritative formulas are in docs/03-eval-metrics.md.
| Metric | Measures | Notes |
|---|---|---|
M1 | Recall under budget | last-N turns, N derived from budget_tokens; the recall baseline. |
M2 | (see metrics doc) | — |
M3 | Token-budget efficiency | tokenizer pinned to cl100k_base for reproducibility. |
M4 | (see metrics doc) | — |
M5 | (see metrics doc) | — |
M6 | Temporal consistency | as_of-aware minus as_of-naive recall; positive = improvement. |
M7 | (see metrics doc) | — |
M8 | Conformance (conformance_rate) | fraction of applicable conformance assertions that pass; reference store = 1.000. |
note
M2/M4/M5/M7 definitions are intentionally not reproduced here to avoid drift — pull them from docs/03-eval-metrics.md, which is canonical.