Skip to content

Development tools

Testing & evals

Letta Evals

Introduction to Letta's evaluation framework for testing and measuring agent performance.

Systematic testing for stateful AI agents. Validate changes, prevent regressions, and ship with confidence.

Test agent memory, tool usage, multi-turn conversations, and state evolution with automated grading and pass/fail gates.

Core Concepts

Understand the building blocks of evaluations:

Suites - Configure your evaluation
Datasets - Define test cases
Targets - Specify the agent to test
Graders - Score agent outputs
Extractors - Extract content from responses
Gates - Set pass/fail criteria

Grading & Extraction

Choose how to score your agents:

Tool Graders - Fast, deterministic grading with Python functions
Rubric Graders - Flexible LLM-as-judge evaluation
Built-in Extractors - Pre-built content extractors
Multi-Metric Grading - Evaluate multiple dimensions

Advanced

Custom Graders - Write your own grading logic
Custom Extractors - Build custom extractors
Multi-Turn Conversations - Test memory and state
Suite YAML Reference - Complete configuration schema

Reference

CLI Commands - Command-line interface
Understanding Results - Interpret metrics
Troubleshooting - Common issues and solutions

Resources

GitHub Repository - Source code, issues, and contributions
PyPI Package - Install with pip install letta-evals