---
title: Multi-turn conversations | Letta Docs
description: Test agent behavior across multiple conversation turns with evaluation datasets and graders.
---

Multi-turn conversations allow you to test how agents handle context across multiple exchanges.

This is essential for stateful agents where behavior depends on conversation history.

## Why Use Multi-Turn?

Multi-turn conversations enable testing that single-turn prompts cannot:

- **Memory storage**: Verify agents persist information to memory blocks
- **Tool call sequences**: Test multi-step workflows
- **Context retention**: Ensure agents remember details from earlier
- **State evolution**: Track how agent state changes across interactions
- **Conversational coherence**: Test if agents maintain context appropriately

## Format

### Single-Turn (Default)

```
{
  "input": "What is the capital of France?",
  "ground_truth": "Paris"
}
```

### Multi-Turn

```
{
  "input": [
    "My name is Alice",
    "What's my name?"
  ],
  "ground_truth": "Alice"
}
```

The agent processes each input in sequence, with state carrying over between turns.

## Per-Turn Evaluation

When both `input` and `ground_truth` are lists of the same length, Letta Evals automatically switches to per-turn evaluation mode. Each turn is graded independently against its corresponding ground truth.

```
{
  "input": [
    "What is the capital of France?",
    "What is the capital of Germany?",
    "What is the capital of Italy?"
  ],
  "ground_truth": ["Paris", "Berlin", "Rome"]
}
```

**Key differences from standard multi-turn:**

| Feature        | Standard Multi-turn | Per-Turn Evaluation             |
| -------------- | ------------------- | ------------------------------- |
| `ground_truth` | Single string       | List (one per turn)             |
| Evaluation     | Final output only   | Each turn independently         |
| Score          | Binary (pass/fail)  | Proportional (avg across turns) |

**When to use per-turn evaluation:**

- Each step in a conversation needs to be correct
- You want to measure partial success (e.g., 2/3 questions answered correctly)
- Testing sequential reasoning where intermediate answers matter
- Evaluating Q\&A agents across multiple questions

See the [multiturn-per-turn-grading example](https://github.com/letta-ai/letta-evals/tree/main/examples/multiturn-per-turn-grading) for a complete implementation.

## Example 1: Memory Recall Testing

Test if the agent remembers information across turns:

```
{
  "input": [
    "Remember that my favorite color is blue",
    "What's my favorite color?"
  ],
  "ground_truth": "blue"
}
```

Suite configuration:

```
graders:
  response_check:
    kind: tool
    function: contains
    extractor: last_assistant # Check the agent's response
```

## Example 2: Memory Correction Testing

Test if the agent correctly updates memory when users correct themselves:

```
{
  "input": [
    "Please remember that I like bananas.",
    "Actually, sorry, I meant I like apples."
  ],
  "ground_truth": "apples"
}
```

Suite configuration:

```
graders:
  memory_check:
    kind: tool
    function: contains
    extractor: memory_block
    extractor_config:
      block_label: human # Check the actual memory block, not just the response
```

**Key difference:** The `memory_block` extractor verifies the agent actually stored the corrected information in memory, not just that it responded correctly. This tests real memory persistence.

## When to Test Memory Blocks vs. Responses

**Use `last_assistant` or `all_assistant` extractors when:**

- Testing what the agent says in conversation
- Verifying response content and phrasing
- Checking conversational coherence

**Use `memory_block` extractor when:**

- Verifying information was actually stored in memory
- Testing memory updates and corrections
- Validating persistent state changes
- Ensuring the agent’s internal state is correct

See the [multiturn-memory-block-extractor example](https://github.com/letta-ai/letta-evals/tree/main/examples/multiturn-memory-block-extractor) for a complete working implementation.

## Next Steps

- [Datasets](/guides/evals/concepts/datasets/index.md) - Creating test datasets
- [Extractors](/guides/evals/concepts/extractors/index.md) - Extracting from trajectories
- [Targets](/guides/evals/concepts/targets/index.md) - Agent lifecycle and testing behavior
