The Letta Leaderboard

Understand which models to use when building your agents

The Letta Leaderboard is open source and we actively encourage contributions! To learn how to add additional results or benchmarking tasks, read our contributor guide.

The Letta Leaderboard helps developers select which language models to use in the Letta framework by reporting the performance of popular models on a series of tasks.

Letta is designed for building stateful agents - agents that are long-running and can automatically manage long-term memory to learn and adapt over time. To implement intelligent memory management, agents in Letta rely heavily on tool (function) calling, so models that excel at tool use tend to do well in Letta. Conversely, models that struggle to call tools properly often perform poorly when used to drive Letta agents.

Memory Benchmarks

The memory benchmarks test the ability of a model to understand a memory hierarchy and manage its own memory. Models that are strong at function calling and aware of their limitations (understanding in-context vs out-of-context data) typically excel here.

Overall Score refers to the average score from memory read, write, and update tasks. Cost refers to (approximate) cost in USD to run the benchmark. Open weights models prefixed with together were run on Together’s API.

Benchmark breakdown →
Model recommendations →

ModelOverall ScoreCost

Try refreshing the page if the leaderboard data is not visible.

Understanding the Benchmark

For a more in-depth breakdown of our memory benchmarks, read our blog.

We measure two foundational aspects of context management: core memory and archival memory. Core memory is what is inside the agent’s context window (aka “in-context memory”) and archival memory is managing context external to the agent (aka “out-of-context memory”, or “external memory”). This benchmark evaluates stateful agent’s fundamental capabilities on reading, writing, and updating memories.

For all the tasks in the memory benchmarks, we generate a fictional question-answering dataset with supporting facts to minimize prior knowledge from LLM training. To evaluate, we use a prompted GPT 4.1 to grade the agent-generated answer and the ground-truth answer, following SimpleQA. We add a penalty for extraneous memory operations to penalize models for inefficient or incorrect archival memory accesses.

Main Results and Recommendations

For the closed model providers (OpenAI, Anthropic, Google):

  • Anthropic Claude Sonnet 4 and OpenAI GPT 4.1 are recommended models for most tasks
  • Normalized for cost, Gemini 2.5 Flash and GPT 4o-mini are top choices
  • Models that perform well on the archival memory task (e.g. Claude Haiku 3.5) might overuse memory operations when unnecessary, thus receiving a lower score on core memory due to the extraneous access penalty.
  • The o-series reasoner models from OpenAI perform worse than GPT 4.1

For the open weights models (Llama, Qwen, Mistral, DeepSeek):

  • Llama 3.3 70B is the best performing (overall)
  • DeepSeek v3 perform similarly to GPT 4.1-nano