Prompt injection protection

Detect and block prompt-injection attempts before they reach the model

Prompt injection protection

Detect and block prompt-injection attempts before they reach the model

Prompt injection is the most common attack against LLM-backed products. Gateway scores every inbound prompt and outbound completion against a tuned classifier, surfaces detections in the dashboard, and, when configured to enforce, blocks or redacts requests above your threshold.

Modes

PI protection has three modes, set per organization:

Mode	Behavior
`off`	The classifier is bypassed. No scoring, no alerts, no enforcement.
`alert`	The classifier runs on every request and emits alerts for detections, but never blocks. Use this to measure detection rate before enforcing.
`block`	Detections at or above the block threshold are enforced. The request is blocked, redacted, rerouted, or escalated based on your action configuration.

Roll out PI protection in alert mode first. Watch the Alerts tab for a week, tune your thresholds and allowlist, then flip to block.

How detection works

Gateway scores prompts using a fine-tuned DeBERTa v3 classifier hosted as a sidecar to the data plane. Each request gets a score from 0.0 (clean) to 1.0 (almost certainly injection). The score is compared against two thresholds:

Threshold	Default	Meaning
`pi_block_threshold`	`0.57`	At or above this score, the request triggers the configured input or output action
`pi_pass_threshold`	`0.30`	At or below this score, the request is treated as clean

Scores between the two thresholds are recorded but do not trigger enforcement. The defaults were calibrated against the deepset, jackhhao, and an internal customer dataset to hit a 1% false-positive rate.

pi_pass_threshold must always be less than or equal to pi_block_threshold.

Actions on detection

Two action settings control what happens when a detection crosses the block threshold:

pi_input_action: applied when the inbound prompt is flagged. Default: block.
pi_output_action: applied when the model’s response is flagged. Default: redact.

Each accepts one of five values:

Action	Effect
`observe`	Record the detection only. No change to the request or response.
`redact`	Replace the flagged span with a placeholder before sending it onward
`route`	Reroute the request to a “safer” vendor configured via `pi_safer_vendor_route`
`block`	Reject the request with a 4xx error
`escalate`	Block the request and mark the alert as escalated for human review

Allowlist patterns

Some legitimate workflows look like injection. For example, security researchers querying a model about jailbreak techniques, or a customer-support tool that summarizes spammy emails. To suppress those false positives, configure an allowlist of regex patterns. Any prompt that matches at least one pattern bypasses the classifier.

Limits:

Up to 50 patterns per org
Each pattern can be up to 200 characters
Patterns must compile as valid Python regex (re.compile). Invalid patterns are rejected on save.

Allowlist matches are also logged so you can audit what’s being suppressed.

Configuring PI protection in the dashboard

Open Security → Prompt Injection in the Merge Gateway dashboard and configure:

Mode (off / alert / block)
Block and pass thresholds
Input and output actions
Allowlist patterns
A “safer” vendor for the route action, if used

Only org members with the Manage security rules permission can change PI settings. Every change is written to the audit log.

Security alerts

In alert and block modes, every detection generates a security alert visible under Security → Alerts in the dashboard. Each alert includes:

Timestamp and request ID
Mode at the time of the event (alert / block)
Classifier score and the configured thresholds
The action that was applied (observe, redact, route, block, escalate)
The triggering segment of the prompt or completion

Alerts are indexed in the same store as the request log, so you can filter by customer, project, or API key when investigating a spike.

The full triggering text is stored on the alert by default. If you handle sensitive prompts, set pi_log_full_text_on_block: false to redact the text from the alert record while keeping the metadata.

What a blocked request looks like

When PI protection blocks an inbound prompt, Gateway returns HTTP 400 with a stable signal you can match on:

1 {
2   "error": {
3     "type": "invalid_request_error",
4     "message": "Request blocked: prompt injection detected.",
5     "code": "pi_blocked"
6   }
7 }

Output blocks return the same shape with a pi_output_blocked code. The stable detection signal is the code field (pi_blocked or pi_output_blocked), which doesn’t change across releases.

FAQ

Does PI protection add latency?

The classifier sidecar adds a few milliseconds per request, negligible compared to LLM inference. The classifier runs in parallel with policy resolution where possible.

What happens when the classifier sidecar is unreachable?

By default, Gateway fails open: if the sidecar errors or times out, the request proceeds as if no detection occurred. Set pi_fail_closed: true to reject the request instead. This is appropriate for high-stakes workloads where any inference is unacceptable when scanning is degraded.

Can I tune the thresholds per project or per customer?

PI protection is an org-level setting in the current release. Use Projects and customer-scoped blocklist rules to apply different routing decisions to different traffic, but the classifier thresholds themselves are org-wide.

How is PI protection different from DLP?

DLP scans for structured sensitive data like SSNs, credit cards, and API keys. PI protection scans for adversarial intent: prompts trying to override system instructions, leak training data, or exfiltrate prior conversation. Run both for full coverage.

Where is the detection score stored?

Every request gets a pi_score field on its log entry, regardless of mode. That means you can backfill an analysis even if you ran in off mode. Re-enable alert and the score starts populating.

Next steps

Data loss prevention

Scan prompts and completions for PII, secrets, and other sensitive data

Customer blocklist

Block or pin per-customer combinations of providers and models

Zero data retention

Restrict routing to vendors with zero data retention agreements

Modes

PI protection has three modes, set per organization:

Mode	Behavior
`off`	The classifier is bypassed. No scoring, no alerts, no enforcement.
`alert`	The classifier runs on every request and emits alerts for detections, but never blocks. Use this to measure detection rate before enforcing.
`block`	Detections at or above the block threshold are enforced. The request is blocked, redacted, rerouted, or escalated based on your action configuration.

Roll out PI protection in alert mode first. Watch the Alerts tab for a week, tune your thresholds and allowlist, then flip to block.

How detection works

Threshold	Default	Meaning
`pi_block_threshold`	`0.57`	At or above this score, the request triggers the configured input or output action
`pi_pass_threshold`	`0.30`	At or below this score, the request is treated as clean

pi_pass_threshold must always be less than or equal to pi_block_threshold.

Actions on detection

Two action settings control what happens when a detection crosses the block threshold:

pi_input_action: applied when the inbound prompt is flagged. Default: block.
pi_output_action: applied when the model’s response is flagged. Default: redact.

Each accepts one of five values:

Action	Effect
`observe`	Record the detection only. No change to the request or response.
`redact`	Replace the flagged span with a placeholder before sending it onward
`route`	Reroute the request to a “safer” vendor configured via `pi_safer_vendor_route`
`block`	Reject the request with a 4xx error
`escalate`	Block the request and mark the alert as escalated for human review

Allowlist patterns

Limits:

Up to 50 patterns per org
Each pattern can be up to 200 characters
Patterns must compile as valid Python regex (re.compile). Invalid patterns are rejected on save.

Allowlist matches are also logged so you can audit what’s being suppressed.

Configuring PI protection in the dashboard

Open Security → Prompt Injection in the Merge Gateway dashboard and configure:

Mode (off / alert / block)
Block and pass thresholds
Input and output actions
Allowlist patterns
A “safer” vendor for the route action, if used

Only org members with the Manage security rules permission can change PI settings. Every change is written to the audit log.

Security alerts

In alert and block modes, every detection generates a security alert visible under Security → Alerts in the dashboard. Each alert includes:

Timestamp and request ID
Mode at the time of the event (alert / block)
Classifier score and the configured thresholds
The action that was applied (observe, redact, route, block, escalate)
The triggering segment of the prompt or completion

Alerts are indexed in the same store as the request log, so you can filter by customer, project, or API key when investigating a spike.

What a blocked request looks like

When PI protection blocks an inbound prompt, Gateway returns HTTP 400 with a stable signal you can match on:

1 {
2   "error": {
3     "type": "invalid_request_error",
4     "message": "Request blocked: prompt injection detected.",
5     "code": "pi_blocked"
6   }
7 }

Output blocks return the same shape with a pi_output_blocked code. The stable detection signal is the code field (pi_blocked or pi_output_blocked), which doesn’t change across releases.

FAQ

Does PI protection add latency?

The classifier sidecar adds a few milliseconds per request, negligible compared to LLM inference. The classifier runs in parallel with policy resolution where possible.

What happens when the classifier sidecar is unreachable?

Can I tune the thresholds per project or per customer?

How is PI protection different from DLP?

Where is the detection score stored?

Every request gets a pi_score field on its log entry, regardless of mode. That means you can backfill an analysis even if you ran in off mode. Re-enable alert and the score starts populating.

Next steps

Data loss prevention

Scan prompts and completions for PII, secrets, and other sensitive data

Customer blocklist

Block or pin per-customer combinations of providers and models

Zero data retention

Restrict routing to vendors with zero data retention agreements

1	{
2	"error": {
3	"type": "invalid_request_error",
4	"message": "Request blocked: prompt injection detected.",
5	"code": "pi_blocked"
6	}
7	}