What is an LLM vulnerability scanner?

An LLM vulnerability scanner sends batteries of adversarial probes — jailbreaks, prompt injections, PII extraction attempts, harmful-content requests — at a target LLM and grades the responses. The output is a vulnerability + optimization report with per-probe evidence, severity, and a prioritized fix list.

Which LLMs can I scan with FilterPrompt?

OpenAI, Anthropic, Google Gemini, Azure OpenAI, plus any OpenAI-compatible endpoint — Ollama, Groq, Mistral, Together AI, OpenRouter, Perplexity, Hugging Face, vLLM, or your own custom endpoint. Bring your own keys per tenant.

What kinds of vulnerabilities does FilterPrompt test for?

Jailbreaks (DAN, role hijack, translation smuggling), direct and indirect prompt injection, system-prompt extraction, harmful-content compliance, PII / secret leakage, bias & fairness, RAG poisoning, agent/tool abuse, output quality, and robustness — categories map to the OWASP LLM Top 10.

How are probes graded?

Each probe declares an evaluator: regex match, refusal-check, contains-check, or an AI judge (Gemini 3 Flash). Pass/fail comes with severity, category, the exact prompt sent, the model's full response, and the evaluator's reason — fully auditable.

How much does a scan cost?

1 credit per probe executed. New accounts get 1 welcome credit on signup. Pay-as-you-go credit packs after that — credits never expire. Connecting LLMs and creating tenants is free.

Real-Time LLM Protection — Monitor and Block Attacks in Production

Guide · 2023-04-18 · 14 min read · FilterPrompt Security Team

How real-time LLM protection works, the architecture that holds up under load, latency budgets, what to monitor, and the production checklist for catching attacks the moment they happen.

Real-time LLM protection is the discipline of catching attacks against your LLM application the moment they happen — at the prompt, before the model responds, and at the response, before it reaches the user. Batch detection (logs reviewed nightly) is fine for forensics but useless against an active prompt injection that exfiltrates a user's data in seconds. This guide covers the architecture, the latency budget, the monitoring signals that matter, and the production playbook for shipping real-time LLM protection that actually holds up.

Why real-time matters

Two reasons. First, attack windows. A prompt injection that succeeds gets one shot to do something destructive — exfiltrate data, call an unauthorised tool, leak the system prompt. If you detect it in nightly log review, the damage is already done. Second, user experience. A real-time block returns a clean error to the user; an after-the-fact response means the model's harmful output already shipped to the user's screen and is in your logs and possibly their training data.

Real-time LLM protection also enables policy enforcement at the prompt level — refuse a request because it violates topical policy (no medical advice, no legal advice, no competitor mentions) without paying the inference cost. For high-volume applications this saves real money on top of the security benefit.

Architecture that holds up

A production-grade real-time LLM protection layer has four detection stages, ordered cheap-to-expensive so cheap detections short-circuit before expensive ones run. Pattern rules first (microseconds, deterministic, catches known templates). Structural validation second (low milliseconds, JSON schema and allowlist checks). Semantic classifiers third (10–60ms, transformer models for novel injection variants). Output-side checks fourth (runs only on responses, looks for exfiltration markdown, PII, secrets). Total median budget: under 100ms.

The fail-open vs fail-closed decision

Every real-time protection layer must answer one question before it ships: when the protection layer itself fails (timeout, dependency outage, panic), does traffic pass through unchecked (fail-open) or get blocked (fail-closed)? Fail-open optimises availability and is correct for low-risk consumer chatbots. Fail-closed optimises security and is correct for healthcare, finance, and any application where a single unfiltered response is worse than an outage. The right answer is per-tenant configurable. FilterPrompt defaults to fail-open with a loud alert; tenants in regulated industries flip to fail-closed.

What to monitor

Block rate by rule — sudden spikes mean either an attack or a false-positive regression
Median + p99 inspection latency — if p99 climbs past 300ms, streaming UX degrades
Detection-layer health — each layer's error rate, separately. A semantic classifier outage should not silently fail-open the whole pipeline
Verdict log volume — a sudden drop usually means an integration outage upstream, not that attacks stopped
OWASP LLM Top 10 coverage by control — auditors will ask
Per-tenant block rate — wildly different rates between similar tenants usually indicate a misconfigured policy

Production playbook

Deploy the protection layer in shadow mode first — log verdicts but do not block. Run for 1–2 weeks to baseline false positives.
Promote rules from shadow to enforce one category at a time. Start with prompt injection patterns (lowest false-positive risk), then PII redaction, then topic enforcement.
Wire verdict logs into your SIEM (Splunk, Datadog, Sentinel) with alerts on anomalous block-rate deltas.
Add a fail-open or fail-closed default per tenant based on their risk tier. Document the choice.
Run an adversarial scanner weekly to verify the protection layer still blocks what it claims (regressions happen when you change models or prompts).
Build a tenant-facing audit export — verdict log, OWASP LLM Top 10 mapping, period summary — so enterprise customers can satisfy their own auditors.

Common mistakes

Three patterns we see fail in production. First, single-layer protection — pattern rules only, no semantic layer. Trivial to bypass with paraphrased instructions. Second, no output-side check. The team protects the input, congratulates itself, and ships an application that happily echoes a markdown image pointing at attacker.com. Third, no verdict logs or unstructured logs. When the auditor asks 'show me what you blocked last quarter' the team has nothing.