Skip to main content

Metrics (M1–M8)

The suite reports eight metrics. The framework is uniform — per-workload scores with bootstrap CIs — while each metric targets a different axis. The authoritative formulas are in docs/03-eval-metrics.md.

MetricMeasuresNotes
M1Recall under budgetlast-N turns, N derived from budget_tokens; the recall baseline.
M2(see metrics doc)
M3Token-budget efficiencytokenizer pinned to cl100k_base for reproducibility.
M4(see metrics doc)
M5(see metrics doc)
M6Temporal consistencyas_of-aware minus as_of-naive recall; positive = improvement.
M7(see metrics doc)
M8Conformance (conformance_rate)fraction of applicable conformance assertions that pass; reference store = 1.000.
note

M2/M4/M5/M7 definitions are intentionally not reproduced here to avoid drift — pull them from docs/03-eval-metrics.md, which is canonical.