FilterPrompt — AI Firewall logo

Building an AI Vulnerability Scanner: Architecture, Probe Library, and Scoring Math

Architecture · 2025-05-08 · 13 min read · FilterPrompt Team

How we built FilterPrompt's AI vulnerability scanner — the async execution model, BYO-key tenant isolation, the layered judge, and the math behind severity-weighted scores.

This post walks through the actual architecture of FilterPrompt's AI vulnerability scanner — the trade-offs, the data model, and the scoring math. If you are evaluating a scanner, these are the questions to ask the vendor. If you are building one, these are the decisions you cannot avoid.

High-level flow

  1. User picks suites + connected LLM provider in the dashboard
  2. Frontend creates a `scans` row and invokes the `scan-run` edge function
  3. Edge function returns immediately; the run executes asynchronously
  4. For each probe: build the request → call the tenant's LLM → judge the response → write a `scan_results` row
  5. Realtime updates stream back to the dashboard; final report is rendered when status flips to `completed`

Data model

Four tables hold the entire scanner domain:

  • `scan_suites` — versioned categories (Prompt Injection, PII, Jailbreaks, Bias, etc.)
  • `scan_probes` — individual adversarial prompts, each with its own judge recipe
  • `scans` — one row per scan run, with status, tenant, model, started_at, completed_at
  • `scan_results` — one row per probe execution: request, response, verdict, judge reasoning, latency, cost

RLS policies scope every read/write to the tenant. The `has_role` security-definer function gates admin-only suite editing. Tenant LLM keys live in `tenant_providers`, encrypted with `ENCRYPTION_SECRET` and only decrypted inside the edge function at request time.

Async execution: why edge functions

A full scan can be 500–5,000 probes. Doing that in a serverless function with a 60-second budget is not viable. We invoke the function, kick off the work, and return a 202 immediately. The function then loops through probes with bounded concurrency (default 5), respecting the provider's rate limits, and writes each result as it lands. The dashboard subscribes via Supabase Realtime and renders progress live.

The judging pipeline

The judge is where vulnerability scanners live or die. Ours is a tiered cascade:

  1. Tier 1 — regex / keyword: 'I cannot', 'I'm sorry', system-prompt fingerprints. Fast, deterministic, free.
  2. Tier 2 — refusal classifier: distinguishes hard refusals from soft refusals that comply anyway.
  3. Tier 3 — contains check: did the response leak the canary the probe planted?
  4. Final tier — AI-based detection evaluates the response against a probe-specific rubric.

Most probes resolve in Tier 1 or 2. Tier 4 is reserved for nuanced cases (bias, hallucination, policy edge cases) because it is the most expensive and the slowest. Every verdict carries the tier that decided it, so reviewers know how much to trust each result.

Scoring math

A raw pass/fail count is misleading — failing one critical PII probe is not equivalent to failing one low-severity tone probe. We compute a severity-weighted score per category:

The overall scan score is the minimum across categories — not the average. One catastrophically failing category should not be hidden by nine passing ones. This is the same philosophy CVSS uses for base-vector scoring.

Cost model: 1 probe = 1 credit

We bill per probe executed because that is the only unit the user can predict and control. Suites publish their probe counts; the dashboard quotes total cost before launch; failed provider calls do not bill. The signup welcome credit covers exactly one probe so you can see the full pipeline end-to-end before paying.

What we'd do differently

  • Move judging to a worker queue earlier — we ran into 60s edge-function ceilings on 1k+ probe scans
  • Cache provider responses keyed by (model, prompt-hash) for deterministic-temperature suites
  • Ship a CLI from day one — every serious customer wants this in CI, not just the dashboard

Related