Letta Evals
Systematic testing for stateful AI agents. Validate changes, prevent regressions, and ship with confidence.
Test agent memory, tool usage, multi-turn conversations, and state evolution with automated grading and pass/fail gates.
Ready to start? Jump to Getting Started or learn the Core Concepts first.
Core Concepts
Understand the building blocks of evaluations:
- Suites - Configure your evaluation
- Datasets - Define test cases
- Targets - Specify the agent to test
- Graders - Score agent outputs
- Extractors - Extract content from responses
- Gates - Set pass/fail criteria
Grading & Extraction
Choose how to score your agents:
- Tool Graders - Fast, deterministic grading with Python functions
- Rubric Graders - Flexible LLM-as-judge evaluation
- Built-in Extractors - Pre-built content extractors
- Multi-Metric Grading - Evaluate multiple dimensions
Advanced
- Custom Graders - Write your own grading logic
- Custom Extractors - Build custom extractors
- Multi-Turn Conversations - Test memory and state
- Suite YAML Reference - Complete configuration schema
Reference
- CLI Commands - Command-line interface
- Understanding Results - Interpret metrics
- Troubleshooting - Common issues and solutions
Resources
- GitHub Repository - Source code, issues, and contributions
- PyPI Package - Install with
pip install letta-evals