Datasets

Datasets are the test cases that define what your agent will be evaluated on. Each sample in your dataset represents one evaluation scenario.

Quick overview:

  • Two formats: JSONL (flexible, powerful) or CSV (simple, spreadsheet-friendly)
  • Required field: input - the prompt(s) to send to the agent
  • Common fields: ground_truth (expected answer), tags (for filtering), metadata (extra info)
  • Advanced fields: agent_args (customize agent per sample), rubric_vars (per-sample rubric context)
  • Multi-turn support: Send multiple messages in sequence using arrays

Typical workflow:

  1. Create a JSONL or CSV file with test cases
  2. Reference it in your suite YAML: dataset: test_cases.jsonl
  3. Run evaluation - each sample is tested independently
  4. Results show per-sample and aggregate scores

Datasets can be created in two formats: JSONL or CSV. Choose based on your team’s workflow and complexity needs.

Dataset Formats

JSONL Format

Each line is a JSON object representing one test case:

1{"input": "What's the capital of France?", "ground_truth": "Paris"}
2{"input": "Calculate 2+2", "ground_truth": "4"}
3{"input": "What color is the sky?", "ground_truth": "blue"}

Best for:

  • Complex data structures (nested objects, arrays)
  • Multi-turn conversations
  • Advanced features (agent_args, rubric_vars)
  • Teams comfortable with JSON/code
  • Version control (clean line-by-line diffs)

CSV Format

Standard CSV with headers:

1input,ground_truth
2"What's the capital of France?","Paris"
3"Calculate 2+2","4"
4"What color is the sky?","blue"

Best for:

  • Simple question-answer pairs
  • Teams that prefer spreadsheets (Excel, Google Sheets)
  • Non-technical collaborators creating test cases
  • Quick dataset creation and editing
  • Easy sharing with non-developers

Quick Reference

FieldRequiredTypePurpose
inputstring or arrayPrompt(s) to send to agent
ground_truthstringExpected answer (for tool graders)
tagsarray of stringsFor filtering samples
agent_argsobjectPer-sample agent customization
rubric_varsobjectPer-sample rubric variables
metadataobjectArbitrary extra data
idintegerSample ID (auto-assigned if omitted)

Field Reference

Required Fields

input

The prompt(s) to send to the agent. Can be a string or array of strings:

Single message:

1{"input": "Hello, who are you?"}

Multi-turn conversation:

1{"input": ["Hello", "What's your name?", "Tell me about yourself"]}

Optional Fields

ground_truth

The expected answer or content to check against. Required for most tool graders (exact_match, contains, etc.):

1{"input": "What is 2+2?", "ground_truth": "4"}

metadata

Arbitrary additional data about the sample:

1{
2 "input": "What is photosynthesis?",
3 "ground_truth": "process where plants convert light into energy",
4 "metadata": {
5 "category": "biology",
6 "difficulty": "medium"
7 }
8}

tags

List of tags for filtering samples:

1{"input": "Solve x^2 = 16", "ground_truth": "4", "tags": ["math", "algebra"]}

Filter by tags in your suite:

1sample_tags: [math] # Only samples tagged "math" will be evaluated

agent_args

Custom arguments passed to programmatic agent creation when using agent_script. Allows per-sample agent customization.

JSONL:

1{
2 "input": "What items do we have?",
3 "agent_args": {
4 "item": {"sku": "SKU-123", "name": "Widget A", "price": 19.99}
5 }
6}

CSV:

1input,agent_args
2"What items do we have?","{""item"": {""sku"": ""SKU-123"", ""name"": ""Widget A"", ""price"": 19.99}}"

Your agent factory function can access these values via sample.agent_args to customize agent configuration.

See Targets - agent_script for details on programmatic agent creation.

rubric_vars

Variables to inject into rubric templates when using rubric graders. This allows you to provide per-sample context or examples to the LLM judge.

Example: Evaluating code quality against a reference implementation.

JSONL:

1{"input": "Write a function to calculate fibonacci numbers", "rubric_vars": {"reference_code": "def fib(n):\n if n <= 1: return n\n return fib(n-1) + fib(n-2)", "required_features": "recursion, base case"}}

CSV:

1input,rubric_vars
2"Write a function to calculate fibonacci numbers","{""reference_code"": ""def fib(n):\n if n <= 1: return n\n return fib(n-1) + fib(n-2)"", ""required_features"": ""recursion, base case""}"

In your rubric template file, reference variables with {variable_name}:

rubric.txt:

Evaluate the submitted code against this reference implementation:
{reference_code}
Required features: {required_features}
Score on correctness (0.6) and code quality (0.4).

When the rubric grader runs, variables are replaced with values from rubric_vars:

Final formatted prompt sent to LLM:

Evaluate the submitted code against this reference implementation:
def fib(n):
if n <= 1: return n
return fib(n-1) + fib(n-2)
Required features: recursion, base case
Score on correctness (0.6) and code quality (0.4).

This lets you customize evaluation criteria per sample using the same rubric template.

See Rubric Graders for details on rubric templates.

id

Sample ID is automatically assigned (0-based index) if not provided. You can override:

1{"id": 42, "input": "Test case 42"}

Complete Example

1{"id": 1, "input": "What is the capital of France?", "ground_truth": "Paris", "tags": ["geography", "easy"], "metadata": {"region": "Europe"}}
2{"id": 2, "input": "Calculate the square root of 144", "ground_truth": "12", "tags": ["math", "medium"]}
3{"id": 3, "input": ["Hello", "What can you help me with?"], "tags": ["conversation"]}

Dataset Best Practices

1. Clear Ground Truth

Make ground truth specific enough to grade but flexible enough to match valid responses:

Good:

1{"input": "What's the largest planet?", "ground_truth": "Jupiter"}

Too strict (might miss valid answers):

1{"input": "What's the largest planet?", "ground_truth": "Jupiter is the largest planet in our solar system."}

2. Diverse Test Cases

Include edge cases and variations:

1{"input": "What is 2+2?", "ground_truth": "4", "tags": ["math", "easy"]}
2{"input": "What is 0.1 + 0.2?", "ground_truth": "0.3", "tags": ["math", "floating_point"]}
3{"input": "What is 999999999 + 1?", "ground_truth": "1000000000", "tags": ["math", "large_numbers"]}

3. Use Tags for Organization

Organize samples by type, difficulty, or feature:

1{"tags": ["tool_usage", "search"]}
2{"tags": ["memory", "recall"]}
3{"tags": ["reasoning", "multi_step"]}

4. Multi-Turn Conversations

Test conversational context and memory updates:

1{"input": ["My name is Alice", "What's my name?"], "ground_truth": "Alice", "tags": ["memory", "recall"]}
2{"input": ["Please remember that I like bananas.", "Actually, sorry, I meant I like apples."], "ground_truth": "apples", "tags": ["memory", "correction"]}
3{"input": ["I work at Google", "Update my workplace to Microsoft", "Where do I work?"], "ground_truth": "Microsoft", "tags": ["memory", "multi_step"]}

Testing memory corrections: Use multi-turn inputs to test if agents properly update memory when users correct themselves. Combine with the memory_block extractor to verify the final memory state, not just the response.

5. No Ground Truth for LLM Judges

If using rubric graders, ground truth is optional:

1{"input": "Write a creative story about a robot", "tags": ["creative"]}
2{"input": "Explain quantum computing simply", "tags": ["explanation"]}

The LLM judge evaluates based on the rubric, not ground truth.

Loading Datasets

Datasets are automatically loaded by the runner:

1dataset: path/to/dataset.jsonl # Path to your test cases (JSONL or CSV)

Paths are relative to the suite YAML file location.

Dataset Filtering

Limit Sample Count

1max_samples: 10 # Only evaluate first 10 samples (useful for testing)

Filter by Tags

1sample_tags: [math, medium] # Only samples with ALL these tags

Creating Datasets Programmatically

You can generate datasets with Python:

1import json
2
3samples = []
4for i in range(100):
5 samples.append({
6 "input": f"What is {i} + {i}?",
7 "ground_truth": str(i + i),
8 "tags": ["math", "addition"]
9 })
10
11with open("dataset.jsonl", "w") as f:
12 for sample in samples:
13 f.write(json.dumps(sample) + "\n")

Dataset Format Validation

The runner validates:

  • Each line is valid JSON
  • Required fields are present
  • Field types are correct

Validation errors will be reported with line numbers.

Examples by Use Case

Question Answering

JSONL:

1{"input": "What is the capital of France?", "ground_truth": "Paris"}
2{"input": "Who wrote Romeo and Juliet?", "ground_truth": "Shakespeare"}

CSV:

1input,ground_truth
2"What is the capital of France?","Paris"
3"Who wrote Romeo and Juliet?","Shakespeare"

Tool Usage Testing

JSONL:

1{"input": "Search for information about pandas", "ground_truth": "search"}
2{"input": "Calculate 15 * 23", "ground_truth": "calculator"}

CSV:

1input,ground_truth
2"Search for information about pandas","search"
3"Calculate 15 * 23","calculator"

Ground truth = expected tool name.

Memory Testing (Multi-turn)

JSONL:

1{"input": ["Remember that my favorite color is blue", "What's my favorite color?"], "ground_truth": "blue"}
2{"input": ["I live in Tokyo", "Where do I live?"], "ground_truth": "Tokyo"}

CSV (using JSON array strings):

1input,ground_truth
2"[""Remember that my favorite color is blue"", ""What's my favorite color?""]","blue"
3"[""I live in Tokyo"", ""Where do I live?""]","Tokyo"

Code Generation

JSONL:

1{"input": "Write a function to reverse a string in Python"}
2{"input": "Create a SQL query to find users older than 21"}

CSV:

1input
2"Write a function to reverse a string in Python"
3"Create a SQL query to find users older than 21"

Use rubric graders to evaluate code quality.

CSV Advanced Features

CSV supports all the same features as JSONL by encoding complex data as JSON strings in cells:

Multi-turn conversations (requires escaped JSON array string):

1input,ground_truth
2"[""Hello"", ""What's your name?""]","Alice"

Agent arguments (requires escaped JSON object string):

1input,agent_args
2"What items do we have?","{""initial_inventory"": [""apple"", ""banana""]}"

Rubric variables (requires escaped JSON object string):

1input,rubric_vars
2"Write a story","{""max_length"": 500, ""genre"": ""sci-fi""}"

Note: Complex data structures require JSON encoding in CSV. If you’re frequently using these advanced features, JSONL may be easier to read and maintain.

Next Steps