For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Logo
Resources
Log inGet a demo
GuidesModelsAPI reference
GuidesModelsAPI reference
    • Get started
    • Install skills
  • Features
    • Projects
    • Management API keys
    • Cost governance and savings
    • Tool calling
    • Web search
    • Context compression
  • Security & Compliance
    • Customer blocklist
    • Geo-location routing
    • Prompt injection protection
    • Data loss prevention
    • Audit trail
    • Roles and permissions
    • Zero data retention
    • Provider terms

Get started

  • Overview
  • Introduction
  • Unified API
  • Linked Account
  • Merge Link
  • Use cases

Implementation

  • Sandboxes
  • SDKs
  • API access
  • Syncing data
  • Writing data
  • Data minimization
  • Supplemental data
  • Errors
  • Integration metadata

API reference

  • ATS
  • HRIS
  • Accounting
  • Ticketing
  • CRM
  • File Storage
  • Knowledge Base
  • Chat

Resources

  • Help Center
  • Merge.dev
  • Changelog
© Merge 2026Terms of usePrivacy policy
UnifiedAgent HandlerGateway
UnifiedAgent HandlerGateway
Resources
Log inGet a demo
On this page
  • Modes
  • How detection works
  • Actions on detection
  • Allowlist patterns
  • Configuring PI protection in the dashboard
  • Security alerts
  • What a blocked request looks like
  • FAQ
  • Next steps
Security & Compliance

Prompt injection protection

Detect and block prompt-injection attempts before they reach the model

Was this page helpful?
Previous

Geo-location routing

Next

Data loss prevention

Prompt injection is the most common attack against LLM-backed products. Gateway scores every inbound prompt and outbound completion against a tuned classifier, surfaces detections in the dashboard, and, when configured to enforce, blocks or redacts requests above your threshold.


Modes

PI protection has three modes, set per organization:

ModeBehavior
offThe classifier is bypassed. No scoring, no alerts, no enforcement.
alertThe classifier runs on every request and emits alerts for detections, but never blocks. Use this to measure detection rate before enforcing.
blockDetections at or above the block threshold are enforced. The request is blocked, redacted, rerouted, or escalated based on your action configuration.

Roll out PI protection in alert mode first. Watch the Alerts tab for a week, tune your thresholds and allowlist, then flip to block.


How detection works

Gateway scores prompts using a fine-tuned DeBERTa v3 classifier hosted as a sidecar to the data plane. Each request gets a score from 0.0 (clean) to 1.0 (almost certainly injection). The score is compared against two thresholds:

ThresholdDefaultMeaning
pi_block_threshold0.57At or above this score, the request triggers the configured input or output action
pi_pass_threshold0.30At or below this score, the request is treated as clean

Scores between the two thresholds are recorded but do not trigger enforcement. The defaults were calibrated against the deepset, jackhhao, and an internal customer dataset to hit a 1% false-positive rate.

pi_pass_threshold must always be less than or equal to pi_block_threshold.


Actions on detection

Two action settings control what happens when a detection crosses the block threshold:

  • pi_input_action: applied when the inbound prompt is flagged. Default: block.
  • pi_output_action: applied when the model’s response is flagged. Default: redact.

Each accepts one of five values:

ActionEffect
observeRecord the detection only. No change to the request or response.
redactReplace the flagged span with a placeholder before sending it onward
routeReroute the request to a “safer” vendor configured via pi_safer_vendor_route
blockReject the request with a 4xx error
escalateBlock the request and mark the alert as escalated for human review

Allowlist patterns

Some legitimate workflows look like injection. For example, security researchers querying a model about jailbreak techniques, or a customer-support tool that summarizes spammy emails. To suppress those false positives, configure an allowlist of regex patterns. Any prompt that matches at least one pattern bypasses the classifier.

Limits:

  • Up to 50 patterns per org
  • Each pattern can be up to 200 characters
  • Patterns must compile as valid Python regex (re.compile). Invalid patterns are rejected on save.

Allowlist matches are also logged so you can audit what’s being suppressed.


Configuring PI protection in the dashboard

Open Security → Prompt Injection in the Merge Gateway dashboard and configure:

  • Mode (off / alert / block)
  • Block and pass thresholds
  • Input and output actions
  • Allowlist patterns
  • A “safer” vendor for the route action, if used

Only org members with the Manage security rules permission can change PI settings. Every change is written to the audit log.


Security alerts

In alert and block modes, every detection generates a security alert visible under Security → Alerts in the dashboard. Each alert includes:

  • Timestamp and request ID
  • Mode at the time of the event (alert / block)
  • Classifier score and the configured thresholds
  • The action that was applied (observe, redact, route, block, escalate)
  • The triggering segment of the prompt or completion

Alerts are indexed in the same store as the request log, so you can filter by customer, project, or API key when investigating a spike.

The full triggering text is stored on the alert by default. If you handle sensitive prompts, set pi_log_full_text_on_block: false to redact the text from the alert record while keeping the metadata.


What a blocked request looks like

When PI protection blocks an inbound prompt, Gateway returns HTTP 400 with a stable signal you can match on:

1{
2 "error": {
3 "type": "invalid_request_error",
4 "message": "Request blocked: prompt injection detected.",
5 "code": "pi_blocked"
6 }
7}

Output blocks return the same shape with a pi_output_blocked code. The stable detection signal is the code field (pi_blocked or pi_output_blocked), which doesn’t change across releases.


FAQ

Does PI protection add latency?

The classifier sidecar adds a few milliseconds per request, negligible compared to LLM inference. The classifier runs in parallel with policy resolution where possible.

What happens when the classifier sidecar is unreachable?

By default, Gateway fails open: if the sidecar errors or times out, the request proceeds as if no detection occurred. Set pi_fail_closed: true to reject the request instead. This is appropriate for high-stakes workloads where any inference is unacceptable when scanning is degraded.

Can I tune the thresholds per project or per customer?

PI protection is an org-level setting in the current release. Use Projects and customer-scoped blocklist rules to apply different routing decisions to different traffic, but the classifier thresholds themselves are org-wide.

How is PI protection different from DLP?

DLP scans for structured sensitive data like SSNs, credit cards, and API keys. PI protection scans for adversarial intent: prompts trying to override system instructions, leak training data, or exfiltrate prior conversation. Run both for full coverage.

Where is the detection score stored?

Every request gets a pi_score field on its log entry, regardless of mode. That means you can backfill an analysis even if you ran in off mode. Re-enable alert and the score starts populating.


Next steps

Data loss prevention

Scan prompts and completions for PII, secrets, and other sensitive data

Customer blocklist

Block or pin per-customer combinations of providers and models

Zero data retention

Restrict routing to vendors with zero data retention agreements