Prompt injection protection

Detect and block prompt-injection attempts before they reach the model.

Prompt injection is the most common attack against LLM-backed products. Gateway scores every inbound prompt and outbound completion against a tuned classifier, surfaces detections in the dashboard, and - when configured to enforce - blocks or redacts requests above your threshold.


Modes

PI protection has three modes, set per organization:

ModeBehavior
offThe classifier is bypassed. No scoring, no alerts, no enforcement.
alertThe classifier runs on every request and emits alerts for detections, but never blocks. Use this to measure detection rate before enforcing.
blockDetections at or above the block threshold are enforced - the request is blocked, redacted, rerouted, or escalated based on your action configuration.

Roll out PI protection in alert mode first. Watch the Alerts tab for a week, tune your thresholds and allowlist, then flip to block.


How detection works

Gateway scores prompts using a fine-tuned DeBERTa v3 classifier hosted as a sidecar to the data plane. Each request gets a score from 0.0 (clean) to 1.0 (almost certainly injection). The score is compared against two thresholds:

ThresholdDefaultMeaning
pi_block_threshold0.57At or above this score, the request triggers the configured input or output action.
pi_pass_threshold0.30At or below this score, the request is treated as clean.

Scores between the two thresholds are recorded but do not trigger enforcement. The defaults were calibrated against the deepset, jackhhao, and an internal customer dataset to hit a 1% false-positive rate.

pi_pass_threshold must always be less than or equal to pi_block_threshold.


Actions on detection

Two action settings control what happens when a detection crosses the block threshold:

  • pi_input_action - applied when the inbound prompt is flagged. Default: block.
  • pi_output_action - applied when the model’s response is flagged. Default: redact.

Each accepts one of five values:

ActionEffect
observeRecord the detection only. No change to the request or response.
redactReplace the flagged span with a placeholder before sending it onward.
routeReroute the request to a “safer” vendor configured via pi_safer_vendor_route.
blockReject the request with a 4xx error.
escalateBlock the request and mark the alert as escalated for human review.

Allowlist patterns

Some legitimate workflows look like injection - for example, security researchers querying a model about jailbreak techniques, or a customer-support tool that summarizes spammy emails. To suppress those false positives, configure an allowlist of regex patterns. Any prompt that matches at least one pattern bypasses the classifier.

Limits:

  • Up to 50 patterns per org.
  • Each pattern can be up to 200 characters.
  • Patterns must compile as valid Python regex (re.compile). Invalid patterns are rejected on save.

Allowlist matches are also logged so you can audit what’s being suppressed.


Configuring PI protection in the dashboard

Open Security → Prompt Injection in the Merge Gateway dashboard and configure:

  • Mode (off / alert / block)
  • Block and pass thresholds
  • Input and output actions
  • Allowlist patterns
  • A “safer” vendor for the route action, if used

Only org admins with the MANAGE_ORG_SETTINGS permission can change PI settings. Every change is written to the audit log.


Security alerts

In alert and block modes, every detection generates a security alert visible under Security → Alerts in the dashboard. Each alert includes:

  • Timestamp and request ID
  • Mode at the time of the event (alert / block)
  • Classifier score and the configured thresholds
  • The action that was applied (observe, redact, route, block, escalate)
  • The triggering segment of the prompt or completion

Alerts are indexed in the same store as the request log, so you can filter by customer, project, or API key when investigating a spike.

The full triggering text is stored on the alert by default. If you handle sensitive prompts, set pi_log_full_text_on_block: false to redact the text from the alert record while keeping the metadata.


What a blocked request looks like

When PI protection blocks an inbound prompt, Gateway returns HTTP 400 with a stable signal you can match on:

1{
2 "error": {
3 "type": "invalid_request_error",
4 "message": "Request blocked: prompt injection detected.",
5 "code": "pi_blocked"
6 }
7}

Output blocks return the same shape with a pi_output_blocked code. The stable detection signal is the code field - pi_blocked / pi_output_blocked - which is stable across releases.


FAQ

The classifier sidecar adds a few milliseconds per request - negligible compared to LLM inference. The classifier runs in parallel with policy resolution where possible.

By default, Gateway fails open: if the sidecar errors or times out, the request proceeds as if no detection occurred. Set pi_fail_closed: true to reject the request instead - appropriate for high-stakes workloads where any inference is unacceptable when scanning is degraded.

PI protection is an org-level setting in the current release. Use Projects and customer-scoped blocklist rules to apply different routing decisions to different traffic, but the classifier thresholds themselves are org-wide.

DLP scans for structured sensitive data like SSNs, credit cards, and API keys. PI protection scans for adversarial intent - prompts trying to override system instructions, leak training data, or exfiltrate prior conversation. Run both for full coverage.

Every request gets a pi_score field on its log entry, regardless of mode. That means you can backfill an analysis even if you ran in off mode - re-enable alert and the score starts populating.


Next steps