Statistical Hypothesis Testing for Auditing Robustness in Language Models

Paulius Rauba; Qiyao Wei; Mihaela van der Schaar

ICML 2025 · interactive exposition · paper #04

Auditing LLM Robustness with
Distribution-Based Perturbation Analysis

Paulius Rauba · Qiyao Wei · Mihaela van der Schaar · University of Cambridge

Asking an LLM the same question twice gives different answers — so a single output diff tells us nothing about whether a perturbation really changed the model's behaviour. We instead sample $N$ outputs under the original prompt and $N$ under the perturbed one, embed each in a low-dimensional semantic space, and run a frequentist permutation test. The result is a p-value with no distributional assumptions.

samples per arm · N

40 / 200

α = 0.05 · perm B = 800

perturbation

N · samples per arm

space stream · r resample · 1–5 perturbation

test statistic · T

0.000

‖x̄₀ − x̄₁‖²

p-value

1.00

FAIL TO REJECT H₀

panel 01

Two prompts · the intervention

Pick an intervention $τ$ applied to a base prompt $x_{0}$ . Some interventions should change the answer; some shouldn't. The audit makes this distinction quantitative.

x₀ · original

x₁ · perturbed

panel 02

Output cloud in semantic space

Each draw $y_{i} \sim f (\cdot ∣ x_{0})$ and $y_{i}^{'} \sim f (\cdot ∣ x_{1})$ is embedded to a low-dim semantic vector. The hypothesis test asks: are these two clouds drawn from the same distribution?

y ∼ f(·|x₀) y′ ∼ f(·|x₁) x̄₀, x̄₁ centroids

permutation null · B = 800 random label shuffles

observed T

0.000

p-value

—

effect · ‖μ̂₀−μ̂₁‖

0.000

panel 03

Multiple perturbations · controlled error rates

Auditing usually means running many perturbations on the same model. With $K$ tests, we adjust to $α^{'} = α / K$ (Bonferroni). Below: each perturbation, its raw p, its adjusted p, and the audit decision.

perturbation	true effect	observed T	p	p · Bonferroni	decision · α=0.05

section 04

How it actually works

Distribution-based perturbation analysis (DBPA) is a frequentist two-sample test on a low-dimensional semantic projection.

Let $f (\cdot ∣ x)$ denote the LLM's stochastic response distribution given prompt $x$ , and $ϕ : Y \to R^{d}$ an embedding into a low-dimensional semantic space. We test the null hypothesis

H_{0} : ϕ_{#} f (\cdot ∣ x_{0}) = ϕ_{#} f (\cdot ∣ x_{1})

Drawing $N$ i.i.d. samples from each side gives empirical centroids $\overset{μ}{^}_{0}, \overset{μ}{^}_{1} \in R^{d}$ . The test statistic is the squared distance:

T = ∥ \overset{μ}{^}_{0} - \overset{μ}{^}_{1} ∥_{2}^{2}

Under $H_{0}$ , the labels ${0, 1}$ are exchangeable, so we approximate the null by permuting them $B$ times and recomputing $T_{b}^{s} t a r$ . The Monte-Carlo p-value is then

\overset{p}{^} = \frac{1 + # { b : T _{b}^{⋆} \geq T }}{B + 1}

The framework is model-agnostic: it treats $f$ as a black box and supports arbitrary perturbations $τ$ mapping $x_{0} \to x_{1} = τ (x_{0})$ . With $K$ simultaneous perturbations, control of family-wise error follows from any standard correction.

def dbpa(f, x0, x1, phi, N=40, B=800):
    # 1 · MC sample N outputs from each arm
    Y0 = [f(x0) for _ in range(N)]
    Y1 = [f(x1) for _ in range(N)]

    # 2 · embed into low-dim semantic space
    Z0 = np.stack([phi(y) for y in Y0])
    Z1 = np.stack([phi(y) for y in Y1])

    # 3 · observed test stat
    T_obs = np.sum((Z0.mean(0) - Z1.mean(0))**2)

    # 4 · permutation null
    Z = np.concatenate([Z0, Z1])
    T_null = np.empty(B)
    for b in range(B):
        idx = rng.permutation(2*N)
        a, b_ = Z[idx[:N]], Z[idx[N:]]
        T_null[b] = np.sum((a.mean(0) - b_.mean(0))**2)

    p = (1 + (T_null >= T_obs).sum()) / (B + 1)
    eff = np.linalg.norm(Z0.mean(0) - Z1.mean(0))
    return DBPAResult(T=T_obs, p=p, eff=eff)

section 05

Cite

@inproceedings{rauba2025statistical,
  title     = {Statistical Hypothesis Testing for Auditing Robustness in Language Models},
  author    = {Paulius Rauba and Qiyao Wei and Mihaela van der Schaar},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  series    = {PMLR},
  volume    = {267},
  pages     = {51297--51313},
  year      = {2025},
  url       = {https://openreview.net/forum?id=ECayXPDoha}
}

Auditing LLM Robustness with Distribution-Based Perturbation Analysis

Auditing LLM Robustness with
Distribution-Based Perturbation Analysis