Multi-modal (image inputs)
Send images to your agents
Multi-modal features require compatible language models. Ensure your agent is configured with a multi-modal capable model.
Letta agents support image inputs, enabling richer conversations and more powerful agent capabilities.
Model Support
Multi-modal capabilities depend on the underlying language model. You can check which models from the API providers support image inputs by checking their individual model pages:
- OpenAI: GPT-4.1, o1/3/4, GPT-4o
- Anthropic: Claude Opus 4, Claude Sonnet 4
- Gemini: Gemini 2.5 Pro, Gemini 2.5 Flash
If the provider you’re using doesn’t support image inputs, your images will still appear in the context window, but as a text message telling the agent that an image exists.
ADE Support
You can pass images to your agents by drag-and-dropping them into the chat window, or clicking the image icon to select a manual file upload.

