Core Concepts

Understanding how Letta Evals works and what makes it different.

Just want to run an eval? Skip to Getting Started for a hands-on quickstart.

Built for Stateful Agents

Letta Evals is a testing framework specifically designed for agents that maintain state. Unlike traditional eval frameworks built for simple input-output models, Letta Evals understands that agents:

  • Maintain memory across conversations
  • Use tools and external functions
  • Evolve their behavior based on interactions
  • Have persistent context and state

This means you can test aspects of your agent that other frameworks can’t: memory updates, multi-turn conversations, tool usage patterns, and state evolution over time.

The Evaluation Flow

Every evaluation follows this flow:

Dataset → Target (Agent) → Extractor → Grader → Gate → Result

  1. Dataset: Your test cases (questions, scenarios, expected outputs)
  2. Target: The agent being evaluated
  3. Extractor: Pulls out the relevant information from the agent’s response
  4. Grader: Scores the extracted information
  5. Gate: Pass/fail criteria for the overall evaluation
  6. Result: Metrics, scores, and detailed results

What You Can Test

With Letta Evals, you can test aspects of agents that traditional frameworks can’t:

  • Memory updates: Did the agent correctly remember the user’s name?
  • Multi-turn conversations: Can the agent maintain context across multiple exchanges?
  • Tool usage: Does the agent call the right tools with the right arguments?
  • State evolution: How does the agent’s internal state change over time?

Example: Testing Memory Updates

1graders:
2 memory_check:
3 kind: tool # Deterministic grading
4 function: contains # Check if ground_truth in extracted content
5 extractor: memory_block # Extract from agent memory (not just response!)
6 extractor_config:
7 block_label: human # Which memory block to check

Dataset:

1{"input": "Please remember that I like bananas.", "ground_truth": "bananas"}

This doesn’t just check if the agent responded correctly - it verifies the agent actually stored “bananas” in its memory block. Traditional eval frameworks can’t inspect agent state like this.

Why Evals Matter

AI agents are complex systems that can behave unpredictably. Without systematic evaluation, you can’t:

  • Know if changes improve or break your agent - Did that prompt tweak help or hurt?
  • Prevent regressions - Catch when “fixes” break existing functionality
  • Compare approaches objectively - Which model works better for your use case?
  • Build confidence before deployment - Ensure quality before shipping to users
  • Track improvement over time - Measure progress as you iterate

Manual testing doesn’t scale. Evals let you test hundreds of scenarios in minutes.

What Evals Are Useful For

1. Development & Iteration

  • Test prompt changes instantly across your entire test suite
  • Experiment with different models and compare results
  • Validate that new features work as expected

2. Quality Assurance

  • Prevent regressions when modifying agent behavior
  • Ensure agents handle edge cases correctly
  • Verify tool usage and memory updates

3. Model Selection

  • Compare GPT-4 vs Claude vs other models on your specific use case
  • Test different model configurations (temperature, system prompts, etc.)
  • Find the right cost/performance tradeoff

4. Benchmarking

  • Measure agent performance on standard tasks
  • Track improvements over time
  • Share reproducible results with your team

5. Production Readiness

  • Validate agents meet quality thresholds before deployment
  • Run continuous evaluation in CI/CD pipelines
  • Monitor production agent quality

How Letta Evals Works

Letta Evals is built around a few key concepts that work together to create a flexible evaluation framework.

Key Components

Suite

An evaluation suite is a complete test configuration defined in a YAML file. It ties together:

  • Which dataset to use
  • Which agent to test
  • How to grade responses
  • What criteria determine pass/fail

Think of a suite as a reusable test specification.

Dataset

A dataset is a JSONL file where each line represents one test case. Each sample has:

  • An input (what to ask the agent)
  • Optional ground truth (the expected answer)
  • Optional metadata (tags, custom fields)

Target

The target is what you’re evaluating. Currently, this is a Letta agent, specified by:

  • An agent file (.af)
  • An existing agent ID
  • A Python script that creates agents programmatically

Trajectory

A trajectory is the complete conversation history from one test case. It’s a list of turns, where each turn contains a list of Letta messages (assistant messages, tool calls, tool returns, etc.).

Extractor

An extractor determines what part of the trajectory to evaluate. For example:

  • The last thing the agent said
  • All tool calls made
  • Content from agent memory
  • Text matching a pattern

Grader

A grader scores how well the agent performed. There are two types:

  • Tool graders: Python functions that compare submission to ground truth
  • Rubric graders: LLM judges that evaluate based on custom criteria

Gate

A gate is the pass/fail threshold for your evaluation. It compares aggregate metrics (like average score or pass rate) against a target value.

Multi-Metric Evaluation

You can define multiple graders in one suite to evaluate different aspects:

1graders:
2 accuracy: # Check if answer is correct
3 kind: tool
4 function: exact_match
5 extractor: last_assistant # Use final response
6
7 tool_usage: # Check if agent called the right tool
8 kind: tool
9 function: contains
10 extractor: tool_arguments # Extract tool call args
11 extractor_config:
12 tool_name: search # From search tool

The gate can check any of these metrics:

1gate:
2 metric_key: accuracy # Gate on accuracy (tool_usage still computed)
3 op: gte # >=
4 value: 0.8 # 80% threshold

Score Normalization

All scores are normalized to the range [0.0, 1.0]:

  • 0.0 = complete failure
  • 1.0 = perfect success
  • Values in between = partial credit

This allows different grader types to be compared and combined.

Aggregate Metrics

Individual sample scores are aggregated in two ways:

  1. Average Score: Mean of all scores (0.0 to 1.0)
  2. Accuracy/Pass Rate: Percentage of samples passing a threshold

You can gate on either metric type.

Next Steps

Dive deeper into each concept:

  • Suites - Suite configuration in detail
  • Datasets - Creating effective test datasets
  • Targets - Agent configuration options
  • Graders - Understanding grader types
  • Extractors - Extraction strategies
  • Gates - Setting pass/fail criteria