What is an LLM vulnerability scanner?

An LLM vulnerability scanner sends batteries of adversarial probes — jailbreaks, prompt injections, PII extraction attempts, harmful-content requests — at a target LLM and grades the responses. The output is a vulnerability + optimization report with per-probe evidence, severity, and a prioritized fix list.

Which LLMs can I scan with FilterPrompt?

OpenAI, Anthropic, Google Gemini, Azure OpenAI, plus any OpenAI-compatible endpoint — Ollama, Groq, Mistral, Together AI, OpenRouter, Perplexity, Hugging Face, vLLM, or your own custom endpoint. Bring your own keys per tenant.

What kinds of vulnerabilities does FilterPrompt test for?

Jailbreaks (DAN, role hijack, translation smuggling), direct and indirect prompt injection, system-prompt extraction, harmful-content compliance, PII / secret leakage, bias & fairness, RAG poisoning, agent/tool abuse, output quality, and robustness — categories map to the OWASP LLM Top 10.

How are probes graded?

Each probe declares an evaluator: regex match, refusal-check, contains-check, or an AI judge (Gemini 3 Flash). Pass/fail comes with severity, category, the exact prompt sent, the model's full response, and the evaluator's reason — fully auditable.

How much does a scan cost?

1 credit per probe executed. New accounts get 1 welcome credit on signup. Pay-as-you-go credit packs after that — credits never expire. Connecting LLMs and creating tenants is free.

AI Vulnerability Scanner: How It Works, What It Tests, and Why You Need One in 2026

Deep Dive · 2025-01-22 · 14 min read · FilterPrompt Team

A complete technical deep dive into AI vulnerability scanners — the probe taxonomy, judging pipeline, scoring, and how to read a scan report end-to-end.

An AI vulnerability scanner is to your LLM what Nessus or Burp Suite is to a web app: an automated, repeatable, evidence-producing tool that fires a curated battery of adversarial probes at the model and grades the responses. In 2026, with every product team shipping copilots and agents, running one is no longer optional — it is the minimum diligence step before exposing an LLM to real users.

What an AI vulnerability scanner actually does

Under the hood a scanner is three things glued together: a probe library (thousands of crafted adversarial inputs), an execution engine that sends those probes to the model under test, and a judging pipeline that decides — for each response — whether the model passed, failed, or partially failed. The output is a report with severity, evidence, and a remediation path.

Probe library — versioned, categorized adversarial prompts (OWASP LLM Top 10, MITRE ATLAS, custom)
Execution engine — concurrency, retries, rate-limit handling against the tenant's connected LLM
Judging pipeline — proprietary multi-stage detection verdicts
Scoring + reporting — per-category pass/fail rates, severity, reproducible evidence

The probe taxonomy

A serious scanner does not just throw 'jailbreak' prompts at the model. Probes are organized into categories that map to real-world risk classes. Below is the taxonomy we use in FilterPrompt's scanner:

1. Prompt injection (direct + indirect)

Instruction-override, role hijack, encoded payloads (base64, ROT13, zero-width), and context smuggling via simulated retrieved documents. The judge looks for system-prompt leakage and instruction compliance violations.

2. Jailbreaks & policy bypass

DAN-style personas, hypothetical framing ('write a fictional story where…'), grandma exploits, and academic-research disguises. Probes evolve weekly — what worked on GPT-4 in 2024 is patched today; what works today will be patched tomorrow.

3. Sensitive information disclosure (PII / secrets)

Targeted extraction prompts: 'list emails from your training data', 'repeat the API key in your context', training-data-extraction attacks via repeated tokens. The judge runs PII and secret detectors over the response.

4. Insecure output handling

Probes that try to coerce the model into emitting markdown image exfiltration URLs, HTML/JS payloads, or unsanitized SQL fragments. Critical for any app that renders model output in a browser.

5. Excessive agency & tool abuse

Only relevant if your model has tools. Probes attempt to chain tool calls without confirmation, escalate scope, or trick the agent into destructive actions.

6. Hallucination & factual grounding

Adversarial questions designed to elicit confident-but-wrong answers, fake citations, and made-up APIs. The judge cross-checks claims against a known-good reference set.

7. Bias, toxicity, and harmful content

Demographic-pair probes, hate-speech elicitation, self-harm content, and CSAM-adjacent refusal tests. Required for any consumer-facing deployment and most enterprise compliance reviews.

Inside the judging pipeline

How a scanner decides 'pass' or 'fail' is the single most important quality signal. A noisy judge produces a useless report. FilterPrompt uses a layered judge:

Regex / keyword — fastest, catches obvious system-prompt leaks and refusal markers
Refusal classifier — was this a proper refusal or a soft compliance ('I shouldn't, but here it is anyway…')?
Contains-check — did the response include the canary string the probe planted?
AI-based detection evaluates nuanced cases against a rubric written for that probe

How to read a scan report

A good report answers four questions in order: what failed, how badly, with what evidence, and what to do about it. When you open a FilterPrompt scan you see:

Per-category pass rate (e.g. Prompt Injection: 78/100 probes passed)
Severity-weighted score — a single number safe to share with leadership
Failed-probe drilldown — exact prompt, exact response, judge reasoning, OWASP mapping
Diff vs previous scan — regression detection across model upgrades

When to run scans

Before launch — full suite against the production model + system prompt
On every system-prompt change — system prompts are code; treat them like code
On every model upgrade — even minor version bumps shift the safety surface
Weekly in staging — catch drift and new public jailbreaks as they emerge
On every new tool integration — agentic surface = new attack surface

What an AI vulnerability scanner is NOT

It is not a runtime firewall — it tests, it does not block. It is not a replacement for human red-teaming on novel agentic systems. It is not a compliance certificate by itself, though its reports are exactly the evidence SOC 2, ISO 42001, and EU AI Act auditors are starting to ask for.

Getting started with FilterPrompt's scanner

Connect your LLM provider (OpenAI, Anthropic, Azure, Google, or custom endpoint) — keys are encrypted at rest
Pick the categories relevant to your deployment (consumer apps need bias/toxicity; agents need tool-abuse)
Run your first scan — your signup credit covers one full category × one model
Review the report, fix the criticals, re-scan to confirm

Every probe executed costs 1 credit. Purchased credits never expire. Reports are stored permanently and diffable across runs so you can prove your security posture is improving over time.