Multimodal

Send images and documents alongside text in your LLM requests.

Gateway accepts multimodal content natively — include image or document content blocks in your messages and Gateway routes to a capable model. No configuration needed. Gateway automatically detects which models support each modality and translates content to the provider’s format.

Supported content types

TypeContent BlocksSource TypesExample Models
Imagesimage, image_urlbase64, URLGPT-5.1, Claude Sonnet 4, Gemini 2.0 Flash
Documentsdocumentbase64, URLClaude Sonnet 4, Gemini 2.0 Flash

Quick example

1from merge_gateway import MergeGateway
2
3client = MergeGateway(api_key="YOUR_API_KEY")
4
5response = client.responses.create(
6 model="openai/gpt-5.1",
7 input=[
8 {
9 "type": "message",
10 "role": "user",
11 "content": [
12 {"type": "text", "text": "What's in this image?"},
13 {"type": "image_url", "url": "https://example.com/photo.jpg"},
14 ],
15 }
16 ],
17)
18
19print(response.output[0].content[0].text)

Model compatibility

Gateway auto-detects multimodal capabilities from model metadata. Use GET /v1/models to check a model’s capabilities.input array and capabilities.vision field.

ProviderImagesDocuments
OpenAIGPT-5.1, GPT-4o
AnthropicClaude Sonnet 4, Claude Haiku 3.5Claude Sonnet 4, Claude Haiku 3.5
GoogleGemini 2.0 Flash, Gemini 2.5 ProGemini 2.0 Flash, Gemini 2.5 Pro
BedrockVaries by modelVaries by model

Context compression automatically protects multimodal messages. When trimming is needed, text-only messages are removed first — your images and documents are preserved.

Next steps