Skip to content
Sign up

Benchmark information

Benchmark datasets and evaluation criteria used in the Letta leaderboard.

We measure two foundational aspects of context management: core memory and archival memory. Core memory is what is inside the agent’s context window (aka “in-context memory”) and archival memory is managing context external to the agent (aka “out-of-context memory”, or “external memory”). This benchmark evaluates stateful agent’s fundamental capabilities on reading, writing, and updating memories.

For all the tasks in Letta Memory Benchmark, we generate a fictional question-answering dataset with supporting facts to minimize prior knowledge from LLM training. To evaluate, we use a prompted GPT 4.1 to grade the agent-generated answer and the ground-truth answer, following SimpleQA. We add a penalty for extraneous memory operations to penalize models for inefficient or incorrect archival memory accesses.

To read about more details on the benchmark, refer to our blog post.

For the closed model providers (OpenAI, Anthropic, Google):

  • Anthropic Claude Sonnet 4 and OpenAI GPT 4.1 are recommended models for most tasks
  • Normalized for cost, Gemini 2.5 Flash and GPT 4o-mini are top choices
  • Models that perform well on the archival memory task (e.g. Claude Haiku 3.5) might overuse memory operations when unnecessary, thus receiving a lower score on core memory due to the extraneous access penalty.
  • The o-series reasoner models from OpenAI perform worse than GPT 4.1

For the open weights models (Llama, Qwen, Mistral, DeepSeek):

  • Llama 3.1 405B is the best performing (overall)
  • Llama 4 Scout 17B and Qwen 2.5 72B perform similarly to GPT 4.1 Mini