Letta Leaderboards
Understand which models to use when building your agents
The Letta Leaderboards are open source and we actively encourage contributions! To learn how to add additional results or benchmarking tasks, read our contributor guide.
The Letta Leaderboards help developers select which language models to use in the Letta framework by reporting the performance of popular models on a series of tasks.
Letta is designed for building stateful agents - agents that are long-running and can automatically manage long-term memory to learn and adapt over time. To implement intelligent memory management, agents in Letta rely heavily on tool (function) calling, so models that excel at tool use tend to do well in Letta. Conversely, models that struggle to call tools properly often perform poorly when used to drive Letta agents.
Letta Memory Leaderboard
The memory benchmark tests the ability of a model to understand a memory hierarchy and manage its own memory. Models that are strong at function calling and aware of their limitations (understanding in-context vs out-of-context data) typically excel here.
Overall Score refers to the combined core memory + archival memory score. Cost refers to (approximate) cost in USD to run the benchmark. Open weights models prefixed with together
were run on Together’s API.
Benchmark breakdown →
Model recommendations →
Try refreshing the page if the leaderboard data is not visible.
Understanding the Benchmark
We measure two foundational aspects of context management: core memory and archival memory. Core memory is what is inside the agent’s context window (aka “in-context memory”) and archival memory is managing context external to the agent (aka “out-of-context memory”, or “external memory”).
For core memory, we measure the model’s capability of both reading and writing from it. Good models should be able to utilize their core (in-context) memory to adapt their outputs and actions. For archival memory, we measure how well a model can retrieve relevant information - good models should understand when to access external information, and should be able to successfully execute one or more retrieval queries.
For all the tasks in Letta Memory Benchmark, we generate a fictional question-answering dataset with supporting facts to minimize prior knowledge from LLM training. To evaluate, we use a prompted GPT-4o to grade the agent-generated answer and the ground-truth answer, following SimpleQA. We add a penalty for extraneous memory operations to penalize models for inefficient or incorrect archival memory accesses.
To read about more details on the benchmark, refer to our blog post.
Results and Recommendations
For the closed model providers (OpenAI, Anthropic, Google):
- Gemini 2.5 Pro, Anthropic Claude Sonnet 3.7, and OpenAI GPT 4.1 are recommended models for most tasks
- Normalized for cost, Gemini 2.5 Flash is a top choice
- Models that perform well on the archival memory task (e.g. Claude Haiku 3.5) might overuse memory operations when unnecessary, thus receiving a lower score on core memory due to the extraneous access penalty.
- The o-series reasoner models from OpenAI perform worse than GPT 4.1
For the open weights models (Llama, Qwen, Mistral, DeepSeek):
- Llama 3.1 405B is the best performing (overall)
- Llama 4 Scout 17B and Qwen 2.5 72B perform similarly to GPT 4.1 Mini