Skip to main content

How Conformance Works

The suite is empirical, not a proof. For each capability it:

  1. Creates a fresh store per seed (isolation between runs).
  2. Writes test data drawn from the locked corpus, issues queries, and checks results against an oracle that derives ground truth.
  3. Scores two-sided — measuring both false positives (e.g. leakage) and false negatives (e.g. over-restriction).
  4. Computes a bootstrap confidence interval per direction.

A direction PASSES iff its CI excludes the failing outcome (GMP §8.2). With deterministic per-seed outcomes the CI is degenerate (lo = hi = mean), so a clean backend passes unambiguously.

Source: src/aml/eval/conformance.py, src/aml/eval/metrics.py