Prompt injection protection
Detect and block prompt-injection attempts before they reach the model
Detect and block prompt-injection attempts before they reach the model
Prompt injection is the most common attack against LLM-backed products. Gateway scores every inbound prompt and outbound completion against a tuned classifier, surfaces detections in the dashboard, and, when configured to enforce, blocks or redacts requests above your threshold.
PI protection has three modes, set per organization:
Roll out PI protection in alert mode first. Watch the Alerts tab for a week, tune your thresholds and allowlist, then flip to block.
Gateway scores prompts using a fine-tuned DeBERTa v3 classifier hosted as a sidecar to the data plane. Each request gets a score from 0.0 (clean) to 1.0 (almost certainly injection). The score is compared against two thresholds:
Scores between the two thresholds are recorded but do not trigger enforcement. The defaults were calibrated against the deepset, jackhhao, and an internal customer dataset to hit a 1% false-positive rate.
pi_pass_threshold must always be less than or equal to pi_block_threshold.
Two action settings control what happens when a detection crosses the block threshold:
pi_input_action: applied when the inbound prompt is flagged. Default: block.pi_output_action: applied when the model’s response is flagged. Default: redact.Each accepts one of five values:
Some legitimate workflows look like injection. For example, security researchers querying a model about jailbreak techniques, or a customer-support tool that summarizes spammy emails. To suppress those false positives, configure an allowlist of regex patterns. Any prompt that matches at least one pattern bypasses the classifier.
Limits:
re.compile). Invalid patterns are rejected on save.Allowlist matches are also logged so you can audit what’s being suppressed.
Open Security → Prompt Injection in the Merge Gateway dashboard and configure:
off / alert / block)route action, if usedOnly org members with the Manage security rules permission can change PI settings. Every change is written to the audit log.
In alert and block modes, every detection generates a security alert visible under Security → Alerts in the dashboard. Each alert includes:
alert / block)observe, redact, route, block, escalate)Alerts are indexed in the same store as the request log, so you can filter by customer, project, or API key when investigating a spike.
The full triggering text is stored on the alert by default. If you handle sensitive prompts, set pi_log_full_text_on_block: false to redact the text from the alert record while keeping the metadata.
When PI protection blocks an inbound prompt, Gateway returns HTTP 400 with a stable signal you can match on:
Output blocks return the same shape with a pi_output_blocked code. The stable detection signal is the code field (pi_blocked or pi_output_blocked), which doesn’t change across releases.
The classifier sidecar adds a few milliseconds per request, negligible compared to LLM inference. The classifier runs in parallel with policy resolution where possible.
By default, Gateway fails open: if the sidecar errors or times out, the request proceeds as if no detection occurred. Set pi_fail_closed: true to reject the request instead. This is appropriate for high-stakes workloads where any inference is unacceptable when scanning is degraded.
PI protection is an org-level setting in the current release. Use Projects and customer-scoped blocklist rules to apply different routing decisions to different traffic, but the classifier thresholds themselves are org-wide.
DLP scans for structured sensitive data like SSNs, credit cards, and API keys. PI protection scans for adversarial intent: prompts trying to override system instructions, leak training data, or exfiltrate prior conversation. Run both for full coverage.
Every request gets a pi_score field on its log entry, regardless of mode. That means you can backfill an analysis even if you ran in off mode. Re-enable alert and the score starts populating.