Skip to content
Sign up

Getting Started with Letta Evals

This guide will help you get up and running with Letta Evals in minutes.

Letta Evals is a framework for testing Letta AI agents. It allows you to:

  • Test agent responses against expected outputs
  • Evaluate subjective quality using LLM judges
  • Test tool usage and memory updates
  • Track metrics across multiple evaluation runs
  • Gate deployments on quality thresholds

Unlike most evaluation frameworks designed for simple input-output models, Letta Evals is built for stateful agents that maintain memory, use tools, and evolve over time.

  • Python 3.11 or higher
  • A running Letta server (local or Letta Cloud)
  • A Letta agent to test, either in agent file format or by ID (see Targets for more details)
Terminal window
pip install letta-evals

Or with uv:

Terminal window
uv pip install letta-evals

Before you can run evaluations, you need a Letta agent. You have two options:

Export an existing agent to a file using the Letta SDK:

from letta_client import Letta
import os
client = Letta(
base_url="http://localhost:8283", # or https://api.letta.com for Letta Cloud
token=os.getenv("LETTA_API_KEY") # required for Letta Cloud
)
# Export an agent to a file
agent_file = client.agents.export_file(agent_id="agent-123")
# Save to disk
with open("my_agent.af", "w") as f:
f.write(agent_file)

Or export via the Agent Development Environment (ADE) by selecting “Export Agent”.

This creates an .af file which you can reference in your suite configuration:

target:
kind: agent
agent_file: my_agent.af

How it works: When using an agent file, a fresh agent instance is created for each sample in your dataset. Each test runs independently with a clean slate, making this ideal for parallel testing across different inputs.

Example: If your dataset has 5 samples, 5 separate agents will be created and can run in parallel. Each agent starts fresh with no memory of the other tests.

If you already have a running agent, use its ID directly:

from letta_client import Letta
import os
client = Letta(
base_url="http://localhost:8283", # or https://api.letta.com for Letta Cloud
token=os.getenv("LETTA_API_KEY") # required for Letta Cloud
)
# List all agents
agents = client.agents.list()
for agent in agents:
print(f"Agent: {agent.name}, ID: {agent.id}")

Then reference it in your suite:

target:
kind: agent
agent_id: agent-abc-123

How it works: The same agent instance is used for all samples, processing them sequentially. The agent’s state (memory, message history) carries over between samples, making the dataset behave more like a conversation script than independent test cases.

Example: If your dataset has 5 samples, they all run against the same agent one after another. The agent “remembers” each previous interaction, so sample 3 can reference information from samples 1 and 2.

Agent File (.af) - Use when testing independent scenarios

Best for testing how the agent responds to independent, isolated inputs. Each sample gets a fresh agent with no prior context. Tests can run in parallel.

Typical scenarios:

  • “How does the agent answer different questions?”
  • “Does the agent correctly use tools for various tasks?”
  • “Testing behavior across different prompts”

Agent ID - Use when testing conversational flows

Best for testing conversational flows or scenarios where context should build up over time. The agent’s state accumulates as it processes each sample sequentially.

Typical scenarios:

  • “Does the agent remember information across a conversation?”
  • “How does the agent’s memory evolve over multiple exchanges?”
  • “Simulating a realistic user session with multiple requests”

Recommendation: For most evaluation scenarios, use agent files to ensure consistent, reproducible test conditions. Only use agent IDs when you specifically want to test stateful, sequential interactions.

For more details on agent lifecycle and testing behaviors, see the Targets guide.

Let’s create your first evaluation in 3 steps:

Create a file named dataset.jsonl:

{"input": "What's the capital of France?", "ground_truth": "Paris"}
{"input": "Calculate 2+2", "ground_truth": "4"}
{"input": "What color is the sky?", "ground_truth": "blue"}

Each line is a JSON object with:

  • input: The prompt to send to your agent
  • ground_truth: The expected answer (used for grading)

Note: ground_truth is optional for some graders (like rubric graders), but required for tool graders like contains and exact_match.

Read more about Datasets for details on how to create your dataset.

Create a file named suite.yaml:

name: my-first-eval
dataset: dataset.jsonl
target:
kind: agent
agent_file: my_agent.af # Path to your agent file
base_url: http://localhost:8283 # Your Letta server
graders:
quality:
kind: tool
function: contains # Check if response contains the ground truth
extractor: last_assistant # Use the last assistant message
gate:
metric_key: quality
op: gte
value: 0.75 # Require 75% pass rate

The suite configuration defines:

Read more about Suites for details on how to configure your evaluation.

Run your evaluation with the following command:

Terminal window
letta-evals run suite.yaml

You’ll see real-time progress as your evaluation runs:

Running evaluation: my-first-eval
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
✓ PASSED (2.25/3.00 avg, 75.0% pass rate)

Read more about CLI Commands for details about the available commands and options.

The core evaluation flow is:

Dataset → Target (Agent) → Extractor → Grader → Gate → Result

The evaluation runner:

  1. Loads your dataset
  2. Sends each input to your agent (Target)
  3. Extracts the relevant information (using the Extractor)
  4. Grades the response (using the Grader function)
  5. Computes aggregate metrics
  6. Checks if metrics pass the Gate criteria

The output shows:

  • Average score: Mean score across all samples
  • Pass rate: Percentage of samples that passed
  • Gate status: Whether the evaluation passed or failed overall

Now that you’ve run your first evaluation, explore more advanced features:

Use exact matching for cases where the answer must be precisely correct:

graders:
accuracy:
kind: tool
function: exact_match
extractor: last_assistant

Use an LLM judge to evaluate subjective qualities like helpfulness or tone:

graders:
quality:
kind: rubric
prompt_path: rubric.txt
model: gpt-4o-mini
extractor: last_assistant

Then create rubric.txt:

Rate the helpfulness and accuracy of the response.
- Score 1.0 if helpful and accurate
- Score 0.5 if partially helpful
- Score 0.0 if unhelpful or wrong

Verify that your agent calls specific tools with expected arguments:

graders:
tool_check:
kind: tool
function: contains
extractor: tool_arguments
extractor_config:
tool_name: search

Check if the agent correctly updates its memory blocks:

graders:
memory_check:
kind: tool
function: contains
extractor: memory_block
extractor_config:
block_label: human

“Agent file not found”

Make sure your agent_file path is correct. Paths are relative to the suite YAML file location. Use absolute paths if needed:

target:
agent_file: /absolute/path/to/my_agent.af

“Connection refused”

Your Letta server isn’t running or isn’t accessible. Start it with:

Terminal window
letta server

By default, it runs at http://localhost:8283.

“No ground_truth provided”

Tool graders like exact_match and contains require ground_truth in your dataset. Either:

  • Add ground_truth to your samples, or
  • Use a rubric grader which doesn’t require ground truth

Agent didn’t respond as expected

Try testing your agent manually first using the Letta SDK or Agent Development Environment (ADE) to see how it behaves before running evaluations. See the Letta documentation for more information.

For more help, see the Troubleshooting Guide.