Skip to content
paulius.rauba
/papers/ auditing-llms
live · v1.0

ICML 2025 · interactive exposition · paper #04

Auditing LLM Robustness with
Distribution-Based Perturbation Analysis

Paulius Rauba · Qiyao Wei · Mihaela van der Schaar · University of Cambridge

Asking an LLM the same question twice gives different answers — so a single output diff tells us nothing about whether a perturbation really changed the model's behaviour. We instead sample outputs under the original prompt and under the perturbed one, embed each in a low-dimensional semantic space, and run a frequentist permutation test. The result is a p-value with no distributional assumptions.

samples per arm · N

40 / 200

α = 0.05 · perm B = 800

perturbation

N · samples per arm

α

space stream · r resample · 15 perturbation

test statistic · T

0.000

‖x̄0 − x̄1‖²

p-value

1.00

FAIL TO REJECT H₀

panel 01

Two prompts · the intervention

Pick an intervention applied to a base prompt . Some interventions should change the answer; some shouldn't. The audit makes this distinction quantitative.

x0 · original

 

x1 · perturbed

 

panel 02

Output cloud in semantic space

Each draw and is embedded to a low-dim semantic vector. The hypothesis test asks: are these two clouds drawn from the same distribution?

y ∼ f(·|x₀) y′ ∼ f(·|x₁) x̄₀, x̄₁ centroids

permutation null · B = 800 random label shuffles

observed T

0.000

p-value

effect · ‖μ̂₀−μ̂₁‖

0.000

panel 03

Multiple perturbations · controlled error rates

Auditing usually means running many perturbations on the same model. With tests, we adjust to (Bonferroni). Below: each perturbation, its raw p, its adjusted p, and the audit decision.

perturbation true effect observed T p p · Bonferroni decision · α=0.05

section 04

How it actually works

Distribution-based perturbation analysis (DBPA) is a frequentist two-sample test on a low-dimensional semantic projection.

Let denote the LLM's stochastic response distribution given prompt , and an embedding into a low-dimensional semantic space. We test the null hypothesis

Drawing i.i.d. samples from each side gives empirical centroids . The test statistic is the squared distance:

Under , the labels are exchangeable, so we approximate the null by permuting them times and recomputing . The Monte-Carlo p-value is then

The framework is model-agnostic: it treats as a black box and supports arbitrary perturbations mapping . With simultaneous perturbations, control of family-wise error follows from any standard correction.

def dbpa(f, x0, x1, phi, N=40, B=800):
    # 1 · MC sample N outputs from each arm
    Y0 = [f(x0) for _ in range(N)]
    Y1 = [f(x1) for _ in range(N)]

    # 2 · embed into low-dim semantic space
    Z0 = np.stack([phi(y) for y in Y0])
    Z1 = np.stack([phi(y) for y in Y1])

    # 3 · observed test stat
    T_obs = np.sum((Z0.mean(0) - Z1.mean(0))**2)

    # 4 · permutation null
    Z = np.concatenate([Z0, Z1])
    T_null = np.empty(B)
    for b in range(B):
        idx = rng.permutation(2*N)
        a, b_ = Z[idx[:N]], Z[idx[N:]]
        T_null[b] = np.sum((a.mean(0) - b_.mean(0))**2)

    p = (1 + (T_null >= T_obs).sum()) / (B + 1)
    eff = np.linalg.norm(Z0.mean(0) - Z1.mean(0))
    return DBPAResult(T=T_obs, p=p, eff=eff)

section 05

Cite

@inproceedings{rauba2025statistical,
  title     = {Statistical Hypothesis Testing for Auditing Robustness in Language Models},
  author    = {Paulius Rauba and Qiyao Wei and Mihaela van der Schaar},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  series    = {PMLR},
  volume    = {267},
  pages     = {51297--51313},
  year      = {2025},
  url       = {https://openreview.net/forum?id=ECayXPDoha}
}