Prompt injection protection
Detect and block prompt-injection attempts before they reach the model.
Prompt injection is the most common attack against LLM-backed products. Gateway scores every inbound prompt and outbound completion against a tuned classifier, surfaces detections in the dashboard, and - when configured to enforce - blocks or redacts requests above your threshold.
Modes
PI protection has three modes, set per organization:
Roll out PI protection in alert mode first. Watch the Alerts tab for a week, tune your thresholds and allowlist, then flip to block.
How detection works
Gateway scores prompts using a fine-tuned DeBERTa v3 classifier hosted as a sidecar to the data plane. Each request gets a score from 0.0 (clean) to 1.0 (almost certainly injection). The score is compared against two thresholds:
Scores between the two thresholds are recorded but do not trigger enforcement. The defaults were calibrated against the deepset, jackhhao, and an internal customer dataset to hit a 1% false-positive rate.
pi_pass_threshold must always be less than or equal to pi_block_threshold.
Actions on detection
Two action settings control what happens when a detection crosses the block threshold:
pi_input_action- applied when the inbound prompt is flagged. Default:block.pi_output_action- applied when the model’s response is flagged. Default:redact.
Each accepts one of five values:
Allowlist patterns
Some legitimate workflows look like injection - for example, security researchers querying a model about jailbreak techniques, or a customer-support tool that summarizes spammy emails. To suppress those false positives, configure an allowlist of regex patterns. Any prompt that matches at least one pattern bypasses the classifier.
Limits:
- Up to 50 patterns per org.
- Each pattern can be up to 200 characters.
- Patterns must compile as valid Python regex (
re.compile). Invalid patterns are rejected on save.
Allowlist matches are also logged so you can audit what’s being suppressed.
Configuring PI protection in the dashboard
Open Security → Prompt Injection in the Merge Gateway dashboard and configure:
- Mode (
off/alert/block) - Block and pass thresholds
- Input and output actions
- Allowlist patterns
- A “safer” vendor for the
routeaction, if used
Only org admins with the MANAGE_ORG_SETTINGS permission can change PI settings. Every change is written to the audit log.
Security alerts
In alert and block modes, every detection generates a security alert visible under Security → Alerts in the dashboard. Each alert includes:
- Timestamp and request ID
- Mode at the time of the event (
alert/block) - Classifier score and the configured thresholds
- The action that was applied (
observe,redact,route,block,escalate) - The triggering segment of the prompt or completion
Alerts are indexed in the same store as the request log, so you can filter by customer, project, or API key when investigating a spike.
The full triggering text is stored on the alert by default. If you handle sensitive prompts, set pi_log_full_text_on_block: false to redact the text from the alert record while keeping the metadata.
What a blocked request looks like
When PI protection blocks an inbound prompt, Gateway returns HTTP 400 with a stable signal you can match on:
Output blocks return the same shape with a pi_output_blocked code. The stable detection signal is the code field - pi_blocked / pi_output_blocked - which is stable across releases.
FAQ
Does PI protection add latency?
The classifier sidecar adds a few milliseconds per request - negligible compared to LLM inference. The classifier runs in parallel with policy resolution where possible.
What happens when the classifier sidecar is unreachable?
By default, Gateway fails open: if the sidecar errors or times out, the request proceeds as if no detection occurred. Set pi_fail_closed: true to reject the request instead - appropriate for high-stakes workloads where any inference is unacceptable when scanning is degraded.
Can I tune the thresholds per project or per customer?
PI protection is an org-level setting in the current release. Use Projects and customer-scoped blocklist rules to apply different routing decisions to different traffic, but the classifier thresholds themselves are org-wide.
How is PI protection different from DLP?
DLP scans for structured sensitive data like SSNs, credit cards, and API keys. PI protection scans for adversarial intent - prompts trying to override system instructions, leak training data, or exfiltrate prior conversation. Run both for full coverage.
Where is the detection score stored?
Every request gets a pi_score field on its log entry, regardless of mode. That means you can backfill an analysis even if you ran in off mode - re-enable alert and the score starts populating.