What is an LLM vulnerability scanner?

An LLM vulnerability scanner sends batteries of adversarial probes — jailbreaks, prompt injections, PII extraction attempts, harmful-content requests — at a target LLM and grades the responses. The output is a vulnerability + optimization report with per-probe evidence, severity, and a prioritized fix list.

Which LLMs can I scan with FilterPrompt?

OpenAI, Anthropic, Google Gemini, Azure OpenAI, plus any OpenAI-compatible endpoint — Ollama, Groq, Mistral, Together AI, OpenRouter, Perplexity, Hugging Face, vLLM, or your own custom endpoint. Bring your own keys per tenant.

What kinds of vulnerabilities does FilterPrompt test for?

Jailbreaks (DAN, role hijack, translation smuggling), direct and indirect prompt injection, system-prompt extraction, harmful-content compliance, PII / secret leakage, bias & fairness, RAG poisoning, agent/tool abuse, output quality, and robustness — categories map to the OWASP LLM Top 10.

How are probes graded?

Each probe declares an evaluator: regex match, refusal-check, contains-check, or an AI judge (Gemini 3 Flash). Pass/fail comes with severity, category, the exact prompt sent, the model's full response, and the evaluator's reason — fully auditable.

How much does a scan cost?

1 credit per probe executed. New accounts get 1 welcome credit on signup. Pay-as-you-go credit packs after that — credits never expire. Connecting LLMs and creating tenants is free.

Building an AI Vulnerability Scanner: Architecture, Probe Library, and Scoring Math

Architecture · 2025-05-08 · 13 min read · FilterPrompt Team

How we built FilterPrompt's AI vulnerability scanner — the async execution model, BYO-key tenant isolation, the layered judge, and the math behind severity-weighted scores.

This post walks through the actual architecture of FilterPrompt's AI vulnerability scanner — the trade-offs, the data model, and the scoring math. If you are evaluating a scanner, these are the questions to ask the vendor. If you are building one, these are the decisions you cannot avoid.

High-level flow

User picks suites + connected LLM provider in the dashboard
Frontend creates a `scans` row and invokes the `scan-run` edge function
Edge function returns immediately; the run executes asynchronously
For each probe: build the request → call the tenant's LLM → judge the response → write a `scan_results` row
Realtime updates stream back to the dashboard; final report is rendered when status flips to `completed`

Data model

Four tables hold the entire scanner domain:

`scan_suites` — versioned categories (Prompt Injection, PII, Jailbreaks, Bias, etc.)
`scan_probes` — individual adversarial prompts, each with its own judge recipe
`scans` — one row per scan run, with status, tenant, model, started_at, completed_at
`scan_results` — one row per probe execution: request, response, verdict, judge reasoning, latency, cost

RLS policies scope every read/write to the tenant. The `has_role` security-definer function gates admin-only suite editing. Tenant LLM keys live in `tenant_providers`, encrypted with `ENCRYPTION_SECRET` and only decrypted inside the edge function at request time.

Async execution: why edge functions

A full scan can be 500–5,000 probes. Doing that in a serverless function with a 60-second budget is not viable. We invoke the function, kick off the work, and return a 202 immediately. The function then loops through probes with bounded concurrency (default 5), respecting the provider's rate limits, and writes each result as it lands. The dashboard subscribes via Supabase Realtime and renders progress live.

The judging pipeline

The judge is where vulnerability scanners live or die. Ours is a tiered cascade:

Tier 1 — regex / keyword: 'I cannot', 'I'm sorry', system-prompt fingerprints. Fast, deterministic, free.
Tier 2 — refusal classifier: distinguishes hard refusals from soft refusals that comply anyway.
Tier 3 — contains check: did the response leak the canary the probe planted?
Final tier — AI-based detection evaluates the response against a probe-specific rubric.

Most probes resolve in Tier 1 or 2. Tier 4 is reserved for nuanced cases (bias, hallucination, policy edge cases) because it is the most expensive and the slowest. Every verdict carries the tier that decided it, so reviewers know how much to trust each result.

Scoring math

A raw pass/fail count is misleading — failing one critical PII probe is not equivalent to failing one low-severity tone probe. We compute a severity-weighted score per category:

The overall scan score is the minimum across categories — not the average. One catastrophically failing category should not be hidden by nine passing ones. This is the same philosophy CVSS uses for base-vector scoring.

Cost model: 1 probe = 1 credit

We bill per probe executed because that is the only unit the user can predict and control. Suites publish their probe counts; the dashboard quotes total cost before launch; failed provider calls do not bill. The signup welcome credit covers exactly one probe so you can see the full pipeline end-to-end before paying.

What we'd do differently

Move judging to a worker queue earlier — we ran into 60s edge-function ceilings on 1k+ probe scans
Cache provider responses keyed by (model, prompt-hash) for deterministic-temperature suites
Ship a CLI from day one — every serious customer wants this in CI, not just the dashboard