Getting Started
Run your first Letta agent evaluation in 5 minutes.
Prerequisites
- Python 3.11 or higher
- A running Letta server (local or Letta Cloud)
- A Letta agent to test, either in agent file format or by ID (see Targets for more details)
Installation
Or with uv:
Getting an Agent to Test
Export an existing agent to a file using the Letta SDK:
Or export via the Agent Development Environment (ADE) by selecting “Export Agent”.
Then reference it in your suite:
Other options: You can also use existing agents by ID or programmatically generate agents. See Targets for all agent configuration options.
Quick Start
Let’s create your first evaluation in 3 steps:
1. Create a Test Dataset
Create a file named dataset.jsonl
:
Each line is a JSON object with:
input
: The prompt to send to your agentground_truth
: The expected answer (used for grading)
ground_truth
is optional for some graders (like rubric graders), but required for tool graders like contains
and exact_match
.
Read more about Datasets for details on how to create your dataset.
2. Create a Suite Configuration
Create a file named suite.yaml
:
The suite configuration defines:
Read more about Suites for details on how to configure your evaluation.
3. Run the Evaluation
Run your evaluation with the following command:
You’ll see real-time progress as your evaluation runs:
Read more about CLI Commands for details about the available commands and options.
Understanding the Results
The core evaluation flow is:
Dataset → Target (Agent) → Extractor → Grader → Gate → Result
The evaluation runner:
- Loads your dataset
- Sends each input to your agent (Target)
- Extracts the relevant information (using the Extractor)
- Grades the response (using the Grader function)
- Computes aggregate metrics
- Checks if metrics pass the Gate criteria
The output shows:
- Average score: Mean score across all samples
- Pass rate: Percentage of samples that passed
- Gate status: Whether the evaluation passed or failed overall
Next Steps
Now that you’ve run your first evaluation, explore more advanced features:
- Core Concepts - Understand suites, datasets, graders, and extractors
- Grader Types - Learn about tool graders vs rubric graders
- Multi-Metric Evaluation - Test multiple aspects simultaneously
- Custom Graders - Write custom grading functions
- Multi-Turn Conversations - Test conversational memory
Common Use Cases
Strict Answer Checking
Use exact matching for cases where the answer must be precisely correct:
Subjective Quality Evaluation
Use an LLM judge to evaluate subjective qualities like helpfulness or tone:
Then create rubric.txt
:
Testing Tool Calls
Verify that your agent calls specific tools with expected arguments:
Testing Memory Persistence
Check if the agent correctly updates its memory blocks:
Troubleshooting
“Agent file not found”
Make sure your agent_file
path is correct. Paths are relative to the suite YAML file location. Use absolute paths if needed:
“Connection refused”
Your Letta server isn’t running or isn’t accessible. Start it with:
By default, it runs at http://localhost:8283
.
“No ground_truth provided”
Tool graders like exact_match
and contains
require ground_truth
in your dataset. Either:
- Add
ground_truth
to your samples, or - Use a rubric grader which doesn’t require ground truth
Agent didn’t respond as expected
Try testing your agent manually first using the Letta SDK or Agent Development Environment (ADE) to see how it behaves before running evaluations. See the Letta documentation for more information.
For more help, see the Troubleshooting Guide.