---
title: Core concepts | Letta Docs
description: Core concepts of the Letta evaluation framework including targets, datasets, graders, and extractors.
---

Understanding how Letta Evals works and what makes it different.

**Just want to run an eval?** Skip to [Getting Started](/guides/evals/getting-started/index.md) for a hands-on quickstart.

## Built for Stateful Agents

Letta Evals is a testing framework specifically designed for agents that maintain state. Unlike traditional eval frameworks built for simple input-output models, Letta Evals understands that agents:

- Maintain memory across conversations
- Use tools and external functions
- Evolve their behavior based on interactions
- Have persistent context and state

This means you can test aspects of your agent that other frameworks can’t: memory updates, multi-turn conversations, tool usage patterns, and state evolution over time.

## The Evaluation Flow

Every evaluation follows this flow:

**Dataset → Target (Agent) → Extractor → Grader → Gate → Result**

1. **Dataset**: Your test cases (questions, scenarios, expected outputs)
2. **Target**: The agent being evaluated
3. **Extractor**: Pulls out the relevant information from the agent’s response
4. **Grader**: Scores the extracted information
5. **Gate**: Pass/fail criteria for the overall evaluation
6. **Result**: Metrics, scores, and detailed results

### What You Can Test

With Letta Evals, you can test aspects of agents that traditional frameworks can’t:

- **Memory updates**: Did the agent correctly remember the user’s name?
- **Multi-turn conversations**: Can the agent maintain context across multiple exchanges?
- **Tool usage**: Does the agent call the right tools with the right arguments?
- **State evolution**: How does the agent’s internal state change over time?

**Example: Testing Memory Updates**

```
graders:
  memory_check:
    kind: tool # Deterministic grading
    function: contains # Check if ground_truth in extracted content
    extractor: memory_block # Extract from agent memory (not just response!)
    extractor_config:
      block_label: human # Which memory block to check
```

Dataset:

```
{
  "input": "Please remember that I like bananas.",
  "ground_truth": "bananas"
}
```

This doesn’t just check if the agent responded correctly - it verifies the agent actually stored “bananas” in its memory block. Traditional eval frameworks can’t inspect agent state like this.

## Why Evals Matter

AI agents are complex systems that can behave unpredictably. Without systematic evaluation, you can’t:

- **Know if changes improve or break your agent** - Did that prompt tweak help or hurt?
- **Prevent regressions** - Catch when “fixes” break existing functionality
- **Compare approaches objectively** - Which model works better for your use case?
- **Build confidence before deployment** - Ensure quality before shipping to users
- **Track improvement over time** - Measure progress as you iterate

Manual testing doesn’t scale. Evals let you test hundreds of scenarios in minutes.

## What Evals Are Useful For

### 1. Development & Iteration

- Test prompt changes instantly across your entire test suite
- Experiment with different models and compare results
- Validate that new features work as expected

### 2. Quality Assurance

- Prevent regressions when modifying agent behavior
- Ensure agents handle edge cases correctly
- Verify tool usage and memory updates

### 3. Model Selection

- Compare GPT-4 vs Claude vs other models on your specific use case
- Test different model configurations (temperature, system prompts, etc.)
- Find the right cost/performance tradeoff

### 4. Benchmarking

- Measure agent performance on standard tasks
- Track improvements over time
- Share reproducible results with your team

### 5. Production Readiness

- Validate agents meet quality thresholds before deployment
- Run continuous evaluation in CI/CD pipelines
- Monitor production agent quality

## How Letta Evals Works

Letta Evals is built around a few key concepts that work together to create a flexible evaluation framework.

## Key Components

### Suite

An **evaluation suite** is a complete test configuration defined in a YAML file. It ties together:

- Which dataset to use
- Which agent to test
- How to grade responses
- What criteria determine pass/fail

Think of a suite as a reusable test specification.

### Dataset

A **dataset** is a JSONL file where each line represents one test case. Each sample has:

- An input (what to ask the agent)
- Optional ground truth (the expected answer)
- Optional metadata (tags, custom fields)

### Target

The **target** is what you’re evaluating. Currently, this is a Letta agent, specified by:

- An agent file (.af)
- An existing agent ID
- A Python script that creates agents programmatically

### Trajectory

A **trajectory** is the complete conversation history from one test case. It’s a list of turns, where each turn contains a list of Letta messages (assistant messages, tool calls, tool returns, etc.).

### Extractor

An **extractor** determines what part of the trajectory to evaluate. For example:

- The last thing the agent said
- All tool calls made
- Content from agent memory
- Text matching a pattern

### Grader

A **grader** scores how well the agent performed. There are two types:

- **Tool graders**: Python functions that compare submission to ground truth
- **Rubric graders**: LLM judges that evaluate based on custom criteria

### Gate

A **gate** is the pass/fail threshold for your evaluation. It compares aggregate metrics (like average score or pass rate) against a target value.

## Multi-Metric Evaluation

You can define multiple graders in one suite to evaluate different aspects:

```
graders:
  accuracy: # Check if answer is correct
    kind: tool
    function: exact_match
    extractor: last_assistant # Use final response


  tool_usage: # Check if agent called the right tool
    kind: tool
    function: contains
    extractor: tool_arguments # Extract tool call args
    extractor_config:
      tool_name: search # From search tool
```

The gate can check any of these metrics:

```
gate:
  metric_key: accuracy # Gate on accuracy (tool_usage still computed)
  op: gte # >=
  value: 0.8 # 80% threshold
```

## Score Normalization

All scores are normalized to the range \[0.0, 1.0]:

- 0.0 = complete failure
- 1.0 = perfect success
- Values in between = partial credit

This allows different grader types to be compared and combined.

## Aggregate Metrics

Individual sample scores are aggregated in two ways:

1. **Average Score**: Mean of all scores (0.0 to 1.0)
2. **Accuracy/Pass Rate**: Percentage of samples passing a threshold

You can gate on either metric type.

## Next Steps

Dive deeper into each concept:

- [Suites](/guides/evals/concepts/suites/index.md) - Suite configuration in detail
- [Datasets](/guides/evals/concepts/datasets/index.md) - Creating effective test datasets
- [Targets](/guides/evals/concepts/targets/index.md) - Agent configuration options
- [Graders](/guides/evals/concepts/graders/index.md) - Understanding grader types
- [Extractors](/guides/evals/concepts/extractors/index.md) - Extraction strategies
- [Gates](/guides/evals/concepts/gates/index.md) - Setting pass/fail criteria
