Core Concepts
Understanding how Letta Evals works and what makes it different.
Just want to run an eval? Skip to Getting Started for a hands-on quickstart.
Built for Stateful Agents
Letta Evals is a testing framework specifically designed for agents that maintain state. Unlike traditional eval frameworks built for simple input-output models, Letta Evals understands that agents:
- Maintain memory across conversations
- Use tools and external functions
- Evolve their behavior based on interactions
- Have persistent context and state
This means you can test aspects of your agent that other frameworks can’t: memory updates, multi-turn conversations, tool usage patterns, and state evolution over time.
The Evaluation Flow
Every evaluation follows this flow:
Dataset → Target (Agent) → Extractor → Grader → Gate → Result
- Dataset: Your test cases (questions, scenarios, expected outputs)
- Target: The agent being evaluated
- Extractor: Pulls out the relevant information from the agent’s response
- Grader: Scores the extracted information
- Gate: Pass/fail criteria for the overall evaluation
- Result: Metrics, scores, and detailed results
What You Can Test
With Letta Evals, you can test aspects of agents that traditional frameworks can’t:
- Memory updates: Did the agent correctly remember the user’s name?
- Multi-turn conversations: Can the agent maintain context across multiple exchanges?
- Tool usage: Does the agent call the right tools with the right arguments?
- State evolution: How does the agent’s internal state change over time?
Example: Testing Memory Updates
Dataset:
This doesn’t just check if the agent responded correctly - it verifies the agent actually stored “bananas” in its memory block. Traditional eval frameworks can’t inspect agent state like this.
Why Evals Matter
AI agents are complex systems that can behave unpredictably. Without systematic evaluation, you can’t:
- Know if changes improve or break your agent - Did that prompt tweak help or hurt?
- Prevent regressions - Catch when “fixes” break existing functionality
- Compare approaches objectively - Which model works better for your use case?
- Build confidence before deployment - Ensure quality before shipping to users
- Track improvement over time - Measure progress as you iterate
Manual testing doesn’t scale. Evals let you test hundreds of scenarios in minutes.
What Evals Are Useful For
1. Development & Iteration
- Test prompt changes instantly across your entire test suite
- Experiment with different models and compare results
- Validate that new features work as expected
2. Quality Assurance
- Prevent regressions when modifying agent behavior
- Ensure agents handle edge cases correctly
- Verify tool usage and memory updates
3. Model Selection
- Compare GPT-4 vs Claude vs other models on your specific use case
- Test different model configurations (temperature, system prompts, etc.)
- Find the right cost/performance tradeoff
4. Benchmarking
- Measure agent performance on standard tasks
- Track improvements over time
- Share reproducible results with your team
5. Production Readiness
- Validate agents meet quality thresholds before deployment
- Run continuous evaluation in CI/CD pipelines
- Monitor production agent quality
How Letta Evals Works
Letta Evals is built around a few key concepts that work together to create a flexible evaluation framework.
Key Components
Suite
An evaluation suite is a complete test configuration defined in a YAML file. It ties together:
- Which dataset to use
- Which agent to test
- How to grade responses
- What criteria determine pass/fail
Think of a suite as a reusable test specification.
Dataset
A dataset is a JSONL file where each line represents one test case. Each sample has:
- An input (what to ask the agent)
- Optional ground truth (the expected answer)
- Optional metadata (tags, custom fields)
Target
The target is what you’re evaluating. Currently, this is a Letta agent, specified by:
- An agent file (.af)
- An existing agent ID
- A Python script that creates agents programmatically
Trajectory
A trajectory is the complete conversation history from one test case. It’s a list of turns, where each turn contains a list of Letta messages (assistant messages, tool calls, tool returns, etc.).
Extractor
An extractor determines what part of the trajectory to evaluate. For example:
- The last thing the agent said
- All tool calls made
- Content from agent memory
- Text matching a pattern
Grader
A grader scores how well the agent performed. There are two types:
- Tool graders: Python functions that compare submission to ground truth
- Rubric graders: LLM judges that evaluate based on custom criteria
Gate
A gate is the pass/fail threshold for your evaluation. It compares aggregate metrics (like average score or pass rate) against a target value.
Multi-Metric Evaluation
You can define multiple graders in one suite to evaluate different aspects:
The gate can check any of these metrics:
Score Normalization
All scores are normalized to the range [0.0, 1.0]:
- 0.0 = complete failure
- 1.0 = perfect success
- Values in between = partial credit
This allows different grader types to be compared and combined.
Aggregate Metrics
Individual sample scores are aggregated in two ways:
- Average Score: Mean of all scores (0.0 to 1.0)
- Accuracy/Pass Rate: Percentage of samples passing a threshold
You can gate on either metric type.
Next Steps
Dive deeper into each concept: