Skip to content
Letta Platform Letta Platform Letta Docs
Sign up
Development tools
Testing & evals
Advanced

Multi-turn conversations

Test agent behavior across multiple conversation turns with evaluation datasets and graders.

Multi-turn conversations allow you to test how agents handle context across multiple exchanges.

Multi-turn conversations enable testing that single-turn prompts cannot:

  • Memory storage: Verify agents persist information to memory blocks
  • Tool call sequences: Test multi-step workflows
  • Context retention: Ensure agents remember details from earlier
  • State evolution: Track how agent state changes across interactions
  • Conversational coherence: Test if agents maintain context appropriately
{
"input": "What is the capital of France?",
"ground_truth": "Paris"
}
{
"input": [
"My name is Alice",
"What's my name?"
],
"ground_truth": "Alice"
}

The agent processes each input in sequence, with state carrying over between turns.

When both input and ground_truth are lists of the same length, Letta Evals automatically switches to per-turn evaluation mode. Each turn is graded independently against its corresponding ground truth.

{
"input": [
"What is the capital of France?",
"What is the capital of Germany?",
"What is the capital of Italy?"
],
"ground_truth": ["Paris", "Berlin", "Rome"]
}

Key differences from standard multi-turn:

FeatureStandard Multi-turnPer-Turn Evaluation
ground_truthSingle stringList (one per turn)
EvaluationFinal output onlyEach turn independently
ScoreBinary (pass/fail)Proportional (avg across turns)

When to use per-turn evaluation:

  • Each step in a conversation needs to be correct
  • You want to measure partial success (e.g., 2/3 questions answered correctly)
  • Testing sequential reasoning where intermediate answers matter
  • Evaluating Q&A agents across multiple questions

See the multiturn-per-turn-grading example for a complete implementation.

Test if the agent remembers information across turns:

{
"input": [
"Remember that my favorite color is blue",
"What's my favorite color?"
],
"ground_truth": "blue"
}

Suite configuration:

graders:
response_check:
kind: tool
function: contains
extractor: last_assistant # Check the agent's response

Test if the agent correctly updates memory when users correct themselves:

{
"input": [
"Please remember that I like bananas.",
"Actually, sorry, I meant I like apples."
],
"ground_truth": "apples"
}

Suite configuration:

graders:
memory_check:
kind: tool
function: contains
extractor: memory_block
extractor_config:
block_label: human # Check the actual memory block, not just the response

Use last_assistant or all_assistant extractors when:

  • Testing what the agent says in conversation
  • Verifying response content and phrasing
  • Checking conversational coherence

Use memory_block extractor when:

  • Verifying information was actually stored in memory
  • Testing memory updates and corrections
  • Validating persistent state changes
  • Ensuring the agent’s internal state is correct

See the multiturn-memory-block-extractor example for a complete working implementation.