Multi-turn conversations

Development tools

Testing & evals

Advanced

Test agent behavior across multiple conversation turns with evaluation datasets and graders.

Multi-turn conversations allow you to test how agents handle context across multiple exchanges.

Why Use Multi-Turn?

Multi-turn conversations enable testing that single-turn prompts cannot:

Memory storage: Verify agents persist information to memory blocks
Tool call sequences: Test multi-step workflows
Context retention: Ensure agents remember details from earlier
State evolution: Track how agent state changes across interactions
Conversational coherence: Test if agents maintain context appropriately

Format

Single-Turn (Default)

{
  "input": "What is the capital of France?",
  "ground_truth": "Paris"
}

Multi-Turn

{
  "input": [
    "My name is Alice",
    "What's my name?"
  ],
  "ground_truth": "Alice"
}

The agent processes each input in sequence, with state carrying over between turns.

Per-Turn Evaluation

When both input and ground_truth are lists of the same length, Letta Evals automatically switches to per-turn evaluation mode. Each turn is graded independently against its corresponding ground truth.

{
  "input": [
    "What is the capital of France?",
    "What is the capital of Germany?",
    "What is the capital of Italy?"
  ],
  "ground_truth": ["Paris", "Berlin", "Rome"]
}

Key differences from standard multi-turn:

Feature	Standard Multi-turn	Per-Turn Evaluation
`ground_truth`	Single string	List (one per turn)
Evaluation	Final output only	Each turn independently
Score	Binary (pass/fail)	Proportional (avg across turns)

When to use per-turn evaluation:

Each step in a conversation needs to be correct
You want to measure partial success (e.g., 2/3 questions answered correctly)
Testing sequential reasoning where intermediate answers matter
Evaluating Q&A agents across multiple questions

See the multiturn-per-turn-grading example for a complete implementation.

Example 1: Memory Recall Testing

Test if the agent remembers information across turns:

{
  "input": [
    "Remember that my favorite color is blue",
    "What's my favorite color?"
  ],
  "ground_truth": "blue"
}

Suite configuration:

graders:
  response_check:
    kind: tool
    function: contains
    extractor: last_assistant # Check the agent's response

Example 2: Memory Correction Testing

Test if the agent correctly updates memory when users correct themselves:

{
  "input": [
    "Please remember that I like bananas.",
    "Actually, sorry, I meant I like apples."
  ],
  "ground_truth": "apples"
}

Suite configuration:

graders:
  memory_check:
    kind: tool
    function: contains
    extractor: memory_block
    extractor_config:
      block_label: human # Check the actual memory block, not just the response

When to Test Memory Blocks vs. Responses

Use last_assistant or all_assistant extractors when:

Testing what the agent says in conversation
Verifying response content and phrasing
Checking conversational coherence

Use memory_block extractor when:

Verifying information was actually stored in memory
Testing memory updates and corrections
Validating persistent state changes
Ensuring the agent’s internal state is correct

See the multiturn-memory-block-extractor example for a complete working implementation.

Next Steps

Datasets - Creating test datasets
Extractors - Extracting from trajectories
Targets - Agent lifecycle and testing behavior