Let f:X→Y be the trained model
and D a finite test set. A slice
S⊆X×Y induces
the hypothesis test
H0:μS=μDvs.H1:μS=μD
With many candidate slices {Sj}j=1m,
controlling the family-wise error at level α requires α' = α/m
(Bonferroni). For data-only methods this is brutal: m
scales with all feature combinations.
Context-aware testing (CAT) instead supplies a sampler
π:C×D×N→2X×Y
that uses context as inductive bias. SMART instantiates
π with an LLM:
- generate · (Hi,Ji)∼l(Ce,CD)
- operationalize · gi:X→{0,1}
- self-falsify · run H0 for slice Si={x:gi(x)=1}
- report · rank by adjusted p-value & effect size
The promise is twofold: fewer tests means looser correction
(better power), and the tests themselves are ones you'd actually
want to write down.
class SMART:
def __call__(self, model_f, data, ctx):
# 1 · generate hypotheses
H = self.llm.generate(ctx_external=ctx, ctx_data=summarize(data))
H = self.llm.refine(H, n=self.budget) # keep top-m
results = []
for H_i, J_i in H:
# 2 · operationalize: NL → boolean predicate
g_i = self.llm.operationalize(H_i, schema=data.schema)
S_i = data[g_i(data)]
if len(S_i) < self.min_n: continue
# 3 · self-falsify with a frequentist test
mu_S, mu_D = loss(model_f, S_i), loss(model_f, data)
p, eff = welch_t(mu_S, mu_D)
results.append(Test(H_i, J_i, p, eff))
# 4 · multiplicity correction + report
return bonferroni(results, alpha=self.alpha)