Letta Evals

Systematic testing for stateful AI agents. Validate changes, prevent regressions, and ship with confidence.

Test agent memory, tool usage, multi-turn conversations, and state evolution with automated grading and pass/fail gates.

Ready to start? Jump to Getting Started or learn the Core Concepts first.

Core Concepts

Understand the building blocks of evaluations:

Suites - Configure your evaluation
Datasets - Define test cases
Targets - Specify the agent to test
Graders - Score agent outputs
Extractors - Extract content from responses
Gates - Set pass/fail criteria

Grading & Extraction

Choose how to score your agents:

Tool Graders - Fast, deterministic grading with Python functions
Rubric Graders - Flexible LLM-as-judge evaluation
Built-in Extractors - Pre-built content extractors
Multi-Metric Grading - Evaluate multiple dimensions

Advanced

Reference

Resources