Getting Started

Run your first Letta agent evaluation in 5 minutes.

Prerequisites

  • Python 3.11 or higher
  • A running Letta server (local or Letta Cloud)
  • A Letta agent to test, either in agent file format or by ID (see Targets for more details)

Installation

$pip install letta-evals

Or with uv:

$uv pip install letta-evals

Getting an Agent to Test

Export an existing agent to a file using the Letta SDK:

1from letta_client import Letta
2import os
3
4client = Letta(
5 base_url="http://localhost:8283", # or https://api.letta.com for Letta Cloud
6 token=os.getenv("LETTA_API_KEY") # required for Letta Cloud
7)
8
9# Export an agent to a file
10agent_file = client.agents.export_file(agent_id="agent-123")
11
12# Save to disk
13with open("my_agent.af", "w") as f:
14 f.write(agent_file)

Or export via the Agent Development Environment (ADE) by selecting “Export Agent”.

Then reference it in your suite:

1target:
2 kind: agent
3 agent_file: my_agent.af

Other options: You can also use existing agents by ID or programmatically generate agents. See Targets for all agent configuration options.

Quick Start

Let’s create your first evaluation in 3 steps:

1. Create a Test Dataset

Create a file named dataset.jsonl:

1{"input": "What's the capital of France?", "ground_truth": "Paris"}
2{"input": "Calculate 2+2", "ground_truth": "4"}
3{"input": "What color is the sky?", "ground_truth": "blue"}

Each line is a JSON object with:

  • input: The prompt to send to your agent
  • ground_truth: The expected answer (used for grading)

ground_truth is optional for some graders (like rubric graders), but required for tool graders like contains and exact_match.

Read more about Datasets for details on how to create your dataset.

2. Create a Suite Configuration

Create a file named suite.yaml:

1name: my-first-eval
2dataset: dataset.jsonl
3
4target:
5 kind: agent
6 agent_file: my_agent.af # Path to your agent file
7 base_url: http://localhost:8283 # Your Letta server
8
9graders:
10 quality:
11 kind: tool
12 function: contains # Check if response contains the ground truth
13 extractor: last_assistant # Use the last assistant message
14
15gate:
16 metric_key: quality
17 op: gte
18 value: 0.75 # Require 75% pass rate

The suite configuration defines:

Read more about Suites for details on how to configure your evaluation.

3. Run the Evaluation

Run your evaluation with the following command:

$letta-evals run suite.yaml

You’ll see real-time progress as your evaluation runs:

Running evaluation: my-first-eval
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
✓ PASSED (2.25/3.00 avg, 75.0% pass rate)

Read more about CLI Commands for details about the available commands and options.

Understanding the Results

The core evaluation flow is:

Dataset → Target (Agent) → Extractor → Grader → Gate → Result

The evaluation runner:

  1. Loads your dataset
  2. Sends each input to your agent (Target)
  3. Extracts the relevant information (using the Extractor)
  4. Grades the response (using the Grader function)
  5. Computes aggregate metrics
  6. Checks if metrics pass the Gate criteria

The output shows:

  • Average score: Mean score across all samples
  • Pass rate: Percentage of samples that passed
  • Gate status: Whether the evaluation passed or failed overall

Next Steps

Now that you’ve run your first evaluation, explore more advanced features:

Common Use Cases

Strict Answer Checking

Use exact matching for cases where the answer must be precisely correct:

1graders:
2 accuracy:
3 kind: tool
4 function: exact_match
5 extractor: last_assistant

Subjective Quality Evaluation

Use an LLM judge to evaluate subjective qualities like helpfulness or tone:

1graders:
2 quality:
3 kind: rubric
4 prompt_path: rubric.txt
5 model: gpt-4o-mini
6 extractor: last_assistant

Then create rubric.txt:

Rate the helpfulness and accuracy of the response.
- Score 1.0 if helpful and accurate
- Score 0.5 if partially helpful
- Score 0.0 if unhelpful or wrong

Testing Tool Calls

Verify that your agent calls specific tools with expected arguments:

1graders:
2 tool_check:
3 kind: tool
4 function: contains
5 extractor: tool_arguments
6 extractor_config:
7 tool_name: search

Testing Memory Persistence

Check if the agent correctly updates its memory blocks:

1graders:
2 memory_check:
3 kind: tool
4 function: contains
5 extractor: memory_block
6 extractor_config:
7 block_label: human

Troubleshooting

“Agent file not found”

Make sure your agent_file path is correct. Paths are relative to the suite YAML file location. Use absolute paths if needed:

1target:
2 agent_file: /absolute/path/to/my_agent.af

“Connection refused”

Your Letta server isn’t running or isn’t accessible. Start it with:

$letta server

By default, it runs at http://localhost:8283.

“No ground_truth provided”

Tool graders like exact_match and contains require ground_truth in your dataset. Either:

  • Add ground_truth to your samples, or
  • Use a rubric grader which doesn’t require ground truth

Agent didn’t respond as expected

Try testing your agent manually first using the Letta SDK or Agent Development Environment (ADE) to see how it behaves before running evaluations. See the Letta documentation for more information.

For more help, see the Troubleshooting Guide.