Letta Evals
Introduction to Letta's evaluation framework for testing and measuring agent performance.
Systematic testing for stateful AI agents. Validate changes, prevent regressions, and ship with confidence.
Test agent memory, tool usage, multi-turn conversations, and state evolution with automated grading and pass/fail gates.
Core Concepts
Section titled “Core Concepts”Understand the building blocks of evaluations:
- Suites - Configure your evaluation
- Datasets - Define test cases
- Targets - Specify the agent to test
- Graders - Score agent outputs
- Extractors - Extract content from responses
- Gates - Set pass/fail criteria
Grading & Extraction
Section titled “Grading & Extraction”Choose how to score your agents:
- Tool Graders - Fast, deterministic grading with Python functions
- Rubric Graders - Flexible LLM-as-judge evaluation
- Built-in Extractors - Pre-built content extractors
- Multi-Metric Grading - Evaluate multiple dimensions
Advanced
Section titled “Advanced”- Custom Graders - Write your own grading logic
- Custom Extractors - Build custom extractors
- Multi-Turn Conversations - Test memory and state
- Suite YAML Reference - Complete configuration schema
Reference
Section titled “Reference”- CLI Commands - Command-line interface
- Understanding Results - Interpret metrics
- Troubleshooting - Common issues and solutions
Resources
Section titled “Resources”- GitHub Repository - Source code, issues, and contributions
- PyPI Package - Install with
pip install letta-evals