Skip to content
paulius.rauba
/papers/ context-aware-testing
live · v1.0

NeurIPS 2024 · interactive exposition · paper #03

Context-Aware
Model Testing

Paulius Rauba · Nabeel Seedat · Max Ruiz Luyten · Mihaela van der Schaar · University of Cambridge

The de facto way of testing ML models is to grep through the data for under-performing slices. But every slice is a hypothesis test — and testing many hypotheses on a finite dataset is a recipe for false positives. SMART instead lets a language model read the context of the task and propose only the failure modes that are relevant, then refutes them with the data.

test budget · m

12 / 60

α = 0.05

deployment context

m

α

change budget · shift+←/→ α · 13 dataset

data-only · TPR

0.0%

fpr 0.0%

SMART · TPR

0.0%

fpr 0.0%

panel 01

Two ways of asking "where does the model fail?"

Both methods get the same budget of tests on the same dataset. Data-only ranks all subgroups by raw divergence. SMART instead samples hypotheses from an LLM conditioned on the deployment context. Bonferroni correction is applied to both — try sliding up.

data-only · ∀ subgroups

12 tests · α' = α/m = 0.0042

    SMART · LLM-proposed

    12 tests · α' = α/m = 0.0042

      panel 02

      Multiple testing erodes data-only power

      As grows, exhaustive data-only search has to correct ever more aggressively to keep the family-wise false-positive rate at α — and statistical power collapses. Context-guided sampling keeps small and relevant.

      true positive rate · power

      false positive rate · type-I error

      panel 03

      What the LLM is reading · context to hypothesis

      SMART's hypothesis generator ingests the deployment context (free-form prompt) and dataset context (feature names, dtypes, sample stats) and emits a ranked stream of plausible failure modes with justifications.

      external context · Ce

      dataset context · CD

      → generated hypotheses · (Hi, Ji) ∼ l(Ce, CD)

        section 04

        How it actually works

        A dataset slice is a hypothesis. Many slices means many hypotheses — and that demands a multiplicity correction.

        Let be the trained model and a finite test set. A slice induces the hypothesis test

        With many candidate slices , controlling the family-wise error at level α requires α' = α/m (Bonferroni). For data-only methods this is brutal: scales with all feature combinations.

        Context-aware testing (CAT) instead supplies a sampler that uses context as inductive bias. SMART instantiates with an LLM:

        • generate ·
        • operationalize ·
        • self-falsify · run for slice
        • report · rank by adjusted -value & effect size

        The promise is twofold: fewer tests means looser correction (better power), and the tests themselves are ones you'd actually want to write down.

        class SMART:
            def __call__(self, model_f, data, ctx):
                # 1 · generate hypotheses
                H = self.llm.generate(ctx_external=ctx, ctx_data=summarize(data))
                H = self.llm.refine(H, n=self.budget)        # keep top-m
        
                results = []
                for H_i, J_i in H:
                    # 2 · operationalize: NL → boolean predicate
                    g_i = self.llm.operationalize(H_i, schema=data.schema)
                    S_i = data[g_i(data)]
                    if len(S_i) < self.min_n: continue
        
                    # 3 · self-falsify with a frequentist test
                    mu_S, mu_D = loss(model_f, S_i), loss(model_f, data)
                    p, eff = welch_t(mu_S, mu_D)
                    results.append(Test(H_i, J_i, p, eff))
        
                # 4 · multiplicity correction + report
                return bonferroni(results, alpha=self.alpha)

        section 05

        Cite

        @inproceedings{rauba2024contextaware,
          title     = {Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models},
          author    = {Paulius Rauba and Nabeel Seedat and Max Ruiz Luyten and Mihaela van der Schaar},
          booktitle = {Advances in Neural Information Processing Systems},
          year      = {2024},
          url       = {https://openreview.net/forum?id=d75qCZb7TX}
        }