FilterPrompt — AI Firewall logo

Repetition Reports for LLMs: Catching Silent Regressions, Drift, and Determinism Loss

Deep Dive · 2024-03-19 · 11 min read · FilterPrompt Team

How to use repeated adversarial scans to detect when your LLM provider silently changes the model under you — a practical guide to delta reports, drift baselines, and CI gates.

The hardest LLM bug to debug is the one you did not ship. Your code did not change. Your prompts did not change. Your tests passed yesterday. This morning, 3% more of your support replies leak the system prompt and your safety classifier is suddenly 8 points worse on the same eval set. Welcome to silent provider-side drift — the reason every serious LLM team now runs repetition reports.

What is a repetition report

A repetition report is the output of running the same vulnerability scan, against the same model, with the same system prompt, on a schedule — and surfacing the diff. Where a one-shot scan tells you 'today, your model fails 12 of 400 probes', a repetition report tells you '+4 new failures since last week, all in the data-exfiltration category, after the provider rolled out a minor model update.'

It is to LLMs what synthetic monitoring is to web apps. You do not wait for users to file tickets — you continuously probe in production-like conditions and alert on regression.

Why drift happens even when you did nothing

  1. Provider model updates — gpt-5 → gpt-5.0.1 silent patch, Claude server-side fine-tune refresh
  2. Routing changes — your 'gpt-5' calls now hit a different inference cluster with different sampling defaults
  3. Safety system updates — provider tightens or loosens output filters; both directions break tests
  4. Tokenizer or context-window changes — same prompt now truncates differently
  5. Non-determinism — temperature > 0, mixture-of-experts routing, batch effects
  6. Your RAG index changed — re-embedding with a new model shifts retrieval, which shifts answers
  7. System-prompt changes from a teammate that 'shouldn't affect anything'

Designing a useful repetition cadence

Daily is overkill for most teams and bankrupts your scan budget. Weekly is the sweet spot for production traffic. Tie an extra ad-hoc scan to these triggers:

  • Provider announces a model update (subscribe to OpenAI / Anthropic / Google changelog feeds)
  • Your CI deploys a system-prompt or tool-schema change
  • RAG ingestion pipeline runs a full re-embed
  • An incident — even a small one — to confirm scope before postmortem

Reading a delta report

A repetition report is only useful if the diff is the headline. The five numbers that matter:

  1. Net new failures — probes that passed before and fail now (the regressions you actually care about)
  2. Net new passes — probes that failed before and pass now (the provider may have patched something for you)
  3. Severity-weighted delta — a new Critical failure outweighs ten new Lows
  4. Category drift — which harm family moved most? A jump in 'jailbreak' suggests safety training changed; a jump in 'PII leakage' suggests output-filter changes
  5. Determinism rate — for probes you ran 3x at temperature 0, how often did the verdict flip? Above 5% means non-determinism is masking real signal

Baselines: the underrated half of the report

A delta is meaningless without a baseline you trust. Establish a baseline by running the same scan three times back-to-back on day one and only treating probes that passed all three as 'reliable green'. Probes that flap on day one will flap forever — exclude them from your alerting set or you will burn out on noise.

Scheduling repeat scans

Inside FilterPrompt you pick a suite, set a cadence (on-demand, nightly, weekly), and the scanner runs the battery automatically against your connected LLM. Each run is stored next to the previous one so you see the delta — net-new Critical / High regressions are highlighted at the top of the report instead of buried in a JSON diff.

Determinism: temperature 0 is not enough

Setting temperature to 0 reduces but does not eliminate non-determinism. Mixture-of-experts routing, server-side batching, and floating-point non-associativity on GPUs all introduce variance. The honest answer is to run every probe N times (3 is usually enough) and report the modal verdict plus the flap rate. A probe that fails 1 of 3 runs is a different signal than one that fails 3 of 3.

What to do when you spot a regression

  1. Confirm by re-running the affected probes 5x — flap rate or real?
  2. Check provider status pages and changelogs for the same window
  3. Diff your own deploys for the same window (system prompt, RAG, tools)
  4. If provider-side: open a support ticket with the exact probe + response, pin to a specific model snapshot if available (e.g. gpt-5-2026-04-15)
  5. If your-side: revert and re-test before debugging in production
  6. Always: log the incident in your AI risk register — auditors will ask

The compounding payoff

Teams that adopt repetition reports stop being surprised. After a quarter of weekly scans you have a graph of robustness over time, you know which harm categories your model+prompt combo is genuinely weak in, and you have a paper trail when a provider update breaks you. That paper trail is also exactly what your EU AI Act and ISO 42001 auditors want — the same artefact does double duty as engineering monitoring and compliance evidence.

Related