Multi-Turn Conversations
Multi-Turn Conversations
Section titled “Multi-Turn Conversations”Multi-turn conversations allow you to test how agents handle context across multiple exchanges - a key capability for stateful agents.
Why Use Multi-Turn?
Section titled “Why Use Multi-Turn?”Multi-turn conversations enable testing that single-turn prompts cannot:
- Memory storage: Verify agents persist information to memory blocks across turns
- Tool call sequences: Test multi-step workflows (e.g., search → analyze → summarize)
- Context retention: Ensure agents remember details from earlier in the conversation
- State evolution: Track how agent state changes across interactions
- Conversational coherence: Test if agents maintain context appropriately
This is essential for stateful agents where behavior depends on conversation history.
Single vs Multi-Turn Format
Section titled “Single vs Multi-Turn Format”Single-Turn (Default)
Section titled “Single-Turn (Default)”Most evaluations use a single prompt:
{ "input": "What is the capital of France?", "ground_truth": "Paris"}The agent receives one message and responds. Single-turn conversations are useful for simpler agents and for testing next-step behavior.
Multi-Turn
Section titled “Multi-Turn”For testing conversational memory, use an array of messages:
{ "input": [ "My name is Alice", "What's my name?" ], "ground_truth": "Alice"}The agent receives multiple messages in sequence:
- Turn 1: “My name is Alice”
- Turn 2: “What’s my name?”
See the built-in extractors for more information on how to use the agent’s response from a multi-turn conversation for grading.
How It Works
Section titled “How It Works”When you provide an array for input, the framework:
- Sends the first message to the agent
- Waits for the agent’s response
- Sends the second message
- Continues until all messages are sent
- Extracts and grades the agent’s response using the specified extractor and grader.
Use Cases
Section titled “Use Cases”Testing Memory Persistence
Section titled “Testing Memory Persistence”{ "input": [ "I live in Paris", "Where do I live?" ], "ground_truth": "Paris"}Tests whether the agent stores information correctly using the memory_block extractor.
Testing Tool Call Sequences
Section titled “Testing Tool Call Sequences”{ "input": [ "Search for pandas", "What did you find about their diet?" ], "ground_truth": "bamboo"}Verifies the agent calls tools in the right order and uses results appropriately.
Testing Context Retention
Section titled “Testing Context Retention”{ "input": [ "My favorite color is blue", "What color do I prefer?" ], "ground_truth": "blue"}Ensures the agent recalls details from earlier in the conversation.
Testing Long-Term Memory
Section titled “Testing Long-Term Memory”{ "input": [ "My name is Alice", "Tell me a joke", "What's my name again?" ], "ground_truth": "Alice"}Checks if the agent remembers information even after intervening exchanges.
Example Configuration
Section titled “Example Configuration”name: multi-turn-testdataset: conversations.jsonl
target: kind: agent agent_file: agent.af base_url: http://localhost:8283
graders: recall: kind: tool function: contains extractor: last_assistant
gate: metric_key: recall op: gte value: 0.8The grader evaluates the agent’s final response (after all turns).
Testing Both Response and Memory
Section titled “Testing Both Response and Memory”Multi-turn evaluations become especially powerful when combined with the memory_block extractor:
graders: response_accuracy: kind: tool function: contains extractor: last_assistant
memory_storage: kind: tool function: contains extractor: memory_block extractor_config: block_label: humanThis tests two things:
- Did the agent respond correctly? (using conversation context)
- Did the agent persist the information? (to its memory blocks)
An agent might pass the first test by keeping information in working memory, but fail the second by not properly storing it for long-term recall.
Context vs Persistence
Section titled “Context vs Persistence”Consider this result:
Results by metric: response_accuracy - Avg: 1.00, Pass: 100.0% memory_storage - Avg: 0.00, Pass: 0.0%The agent answered correctly (100%) but didn’t store anything in memory (0%). This reveals important agent behavior:
- Working memory: Agent kept information in conversation context
- Persistent memory: Agent didn’t update its memory blocks
For short conversations, working memory is sufficient. For long-term interactions, persistent memory is crucial.
Complete Example
Section titled “Complete Example”See examples/multi-turn-memory/ for a working example that demonstrates:
- Multi-turn conversation format
- Dual metric evaluation (response + memory)
- The difference between context-based recall and true persistence
Best Practices
Section titled “Best Practices”1. Keep Turns Focused
Section titled “1. Keep Turns Focused”Each turn should test one aspect of memory or context:
{ "input": [ "I'm allergic to peanuts", "Can I eat this cookie?" ], "ground_truth": "peanut"}2. Test Realistic Scenarios
Section titled “2. Test Realistic Scenarios”Design conversations that mirror real user interactions:
{ "input": [ "Set a reminder for tomorrow at 2pm", "What reminders do I have?" ], "ground_truth": "2pm"}3. Use Tags for Organization
Section titled “3. Use Tags for Organization”Tag multi-turn samples to distinguish them:
{ "input": [ "Hello", "How are you?" ], "tags": [ "multi-turn", "greeting" ]}4. Test Memory Limits
Section titled “4. Test Memory Limits”See how far back agents can recall:
{ "input": [ "My name is Alice", "message 2", "message 3", "message 4", "What's my name?" ], "ground_truth": "Alice"}5. Combine with Memory Extractors
Section titled “5. Combine with Memory Extractors”Always verify both response and internal state for memory tests.
Limitations
Section titled “Limitations”Turn Count
Section titled “Turn Count”Very long conversations may exceed context windows. Monitor token usage for conversations with many turns.
State Isolation
Section titled “State Isolation”Each sample starts with a fresh agent (or fresh conversation if using agent_id). Multi-turn tests memory within a single conversation, not across separate conversations.
Extraction
Section titled “Extraction”Most extractors work on the final state. If you need to check intermediate turns, consider using custom extractors.
Next Steps
Section titled “Next Steps”- Built-in Extractors - Using memory_block extractor
- Custom Extractors - Build extractors for complex scenarios
- Multi-Metric Evaluation - Combine multiple checks