Datasets
Datasets are the test cases that define what your agent will be evaluated on. Each sample in your dataset represents one evaluation scenario.
Quick overview:
- Two formats: JSONL (flexible, powerful) or CSV (simple, spreadsheet-friendly)
- Required field:
input
- the prompt(s) to send to the agent - Common fields:
ground_truth
(expected answer),tags
(for filtering),metadata
(extra info) - Advanced fields:
agent_args
(customize agent per sample),rubric_vars
(per-sample rubric context) - Multi-turn support: Send multiple messages in sequence using arrays
Typical workflow:
- Create a JSONL or CSV file with test cases
- Reference it in your suite YAML:
dataset: test_cases.jsonl
- Run evaluation - each sample is tested independently
- Results show per-sample and aggregate scores
Datasets can be created in two formats: JSONL or CSV. Choose based on your team’s workflow and complexity needs.
Dataset Formats
JSONL Format
Each line is a JSON object representing one test case:
Best for:
- Complex data structures (nested objects, arrays)
- Multi-turn conversations
- Advanced features (agent_args, rubric_vars)
- Teams comfortable with JSON/code
- Version control (clean line-by-line diffs)
CSV Format
Standard CSV with headers:
Best for:
- Simple question-answer pairs
- Teams that prefer spreadsheets (Excel, Google Sheets)
- Non-technical collaborators creating test cases
- Quick dataset creation and editing
- Easy sharing with non-developers
Quick Reference
Field Reference
Required Fields
input
The prompt(s) to send to the agent. Can be a string or array of strings:
Single message:
Multi-turn conversation:
Optional Fields
ground_truth
The expected answer or content to check against. Required for most tool graders (exact_match, contains, etc.):
metadata
Arbitrary additional data about the sample:
tags
List of tags for filtering samples:
Filter by tags in your suite:
agent_args
Custom arguments passed to programmatic agent creation when using agent_script
. Allows per-sample agent customization.
JSONL:
CSV:
Your agent factory function can access these values via sample.agent_args
to customize agent configuration.
See Targets - agent_script for details on programmatic agent creation.
rubric_vars
Variables to inject into rubric templates when using rubric graders. This allows you to provide per-sample context or examples to the LLM judge.
Example: Evaluating code quality against a reference implementation.
JSONL:
CSV:
In your rubric template file, reference variables with {variable_name}
:
rubric.txt:
When the rubric grader runs, variables are replaced with values from rubric_vars
:
Final formatted prompt sent to LLM:
This lets you customize evaluation criteria per sample using the same rubric template.
See Rubric Graders for details on rubric templates.
id
Sample ID is automatically assigned (0-based index) if not provided. You can override:
Complete Example
Dataset Best Practices
1. Clear Ground Truth
Make ground truth specific enough to grade but flexible enough to match valid responses:
Good:
Too strict (might miss valid answers):
2. Diverse Test Cases
Include edge cases and variations:
3. Use Tags for Organization
Organize samples by type, difficulty, or feature:
4. Multi-Turn Conversations
Test conversational context and memory updates:
Testing memory corrections: Use multi-turn inputs to test if agents properly update memory when users correct themselves. Combine with the memory_block
extractor to verify the final memory state, not just the response.
5. No Ground Truth for LLM Judges
If using rubric graders, ground truth is optional:
The LLM judge evaluates based on the rubric, not ground truth.
Loading Datasets
Datasets are automatically loaded by the runner:
Paths are relative to the suite YAML file location.
Dataset Filtering
Limit Sample Count
Filter by Tags
Creating Datasets Programmatically
You can generate datasets with Python:
Dataset Format Validation
The runner validates:
- Each line is valid JSON
- Required fields are present
- Field types are correct
Validation errors will be reported with line numbers.
Examples by Use Case
Question Answering
JSONL:
CSV:
Tool Usage Testing
JSONL:
CSV:
Ground truth = expected tool name.
Memory Testing (Multi-turn)
JSONL:
CSV (using JSON array strings):
Code Generation
JSONL:
CSV:
Use rubric graders to evaluate code quality.
CSV Advanced Features
CSV supports all the same features as JSONL by encoding complex data as JSON strings in cells:
Multi-turn conversations (requires escaped JSON array string):
Agent arguments (requires escaped JSON object string):
Rubric variables (requires escaped JSON object string):
Note: Complex data structures require JSON encoding in CSV. If you’re frequently using these advanced features, JSONL may be easier to read and maintain.
Next Steps
- Suite YAML Reference - Complete configuration options including filtering
- Graders - How to evaluate agent responses
- Multi-Turn Conversations - Testing conversational flows