Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models

Paulius Rauba; Nabeel Seedat; Max Ruiz Luyten; Mihaela van der Schaar

NeurIPS 2024 · interactive exposition · paper #03

Context-Aware
Model Testing

Paulius Rauba · Nabeel Seedat · Max Ruiz Luyten · Mihaela van der Schaar · University of Cambridge

The de facto way of testing ML models is to grep through the data for under-performing slices. But every slice is a hypothesis test — and testing many hypotheses on a finite dataset is a recipe for false positives. SMART instead lets a language model read the context of the task and propose only the failure modes that are relevant, then refutes them with the data.

test budget · m

12 / 60

α = 0.05

deployment context

α

←→ change budget · shift+←/→ α · 1–3 dataset

data-only · TPR

0.0%

fpr 0.0%

SMART · TPR

0.0%

fpr 0.0%

panel 01

Two ways of asking "where does the model fail?"

Both methods get the same budget of $m$ tests on the same dataset. Data-only ranks all subgroups by raw divergence. SMART instead samples $m$ hypotheses from an LLM conditioned on the deployment context. Bonferroni correction is applied to both — try sliding $m$ up.

data-only · ∀ subgroups

12 tests · α' = α/m = 0.0042

SMART · LLM-proposed

12 tests · α' = α/m = 0.0042

panel 02

Multiple testing erodes data-only power

As $m$ grows, exhaustive data-only search has to correct ever more aggressively to keep the family-wise false-positive rate at α — and statistical power collapses. Context-guided sampling keeps $m$ small and relevant.

true positive rate · power

false positive rate · type-I error

panel 03

What the LLM is reading · context to hypothesis

SMART's hypothesis generator $l (C_{e}, C_{D})$ ingests the deployment context (free-form prompt) and dataset context (feature names, dtypes, sample stats) and emits a ranked stream of plausible failure modes with justifications.

external context · C_e

dataset context · C_D

→ generated hypotheses · (H_i, J_i) ∼ l(C_e, C_D)

section 04

How it actually works

A dataset slice is a hypothesis. Many slices means many hypotheses — and that demands a multiplicity correction.

Let $f : X \to Y$ be the trained model and $D$ a finite test set. A slice $S \subseteq X \times Y$ induces the hypothesis test

H_{0} : μ_{S} = μ_{D} vs. H_{1} : μ_{S} \neq = μ_{D}

With many candidate slices ${S_{j}}_{j = 1}^{m}$ , controlling the family-wise error at level α requires α' = α/m (Bonferroni). For data-only methods this is brutal: $m$ scales with all feature combinations.

Context-aware testing (CAT) instead supplies a sampler $π : C \times D \times N \to 2^{X \times Y}$ that uses context as inductive bias. SMART instantiates $π$ with an LLM:

generate · $(H_{i}, J_{i}) \sim l (C_{e}, C_{D})$
operationalize · $g_{i} : X \to {0, 1}$
self-falsify · run $H_{0}$ for slice $S_{i} = {x : g_{i} (x) = 1}$
report · rank by adjusted $p$ -value & effect size

The promise is twofold: fewer tests means looser correction (better power), and the tests themselves are ones you'd actually want to write down.

class SMART:
    def __call__(self, model_f, data, ctx):
        # 1 · generate hypotheses
        H = self.llm.generate(ctx_external=ctx, ctx_data=summarize(data))
        H = self.llm.refine(H, n=self.budget)        # keep top-m

        results = []
        for H_i, J_i in H:
            # 2 · operationalize: NL → boolean predicate
            g_i = self.llm.operationalize(H_i, schema=data.schema)
            S_i = data[g_i(data)]
            if len(S_i) < self.min_n: continue

            # 3 · self-falsify with a frequentist test
            mu_S, mu_D = loss(model_f, S_i), loss(model_f, data)
            p, eff = welch_t(mu_S, mu_D)
            results.append(Test(H_i, J_i, p, eff))

        # 4 · multiplicity correction + report
        return bonferroni(results, alpha=self.alpha)

section 05

Cite

@inproceedings{rauba2024contextaware,
  title     = {Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models},
  author    = {Paulius Rauba and Nabeel Seedat and Max Ruiz Luyten and Mihaela van der Schaar},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2024},
  url       = {https://openreview.net/forum?id=d75qCZb7TX}
}

Context-Aware Model Testing

Context-Aware
Model Testing