---
title: Datasets | Letta Docs
description: Create and manage evaluation datasets with test cases for systematic agent testing.
---

**Datasets** are the test cases that define what your agent will be evaluated on. Each sample in your dataset represents one evaluation scenario.

**Quick overview:** - **Two formats**: JSONL (flexible, powerful) or CSV (simple, spreadsheet-friendly) - **Required field**: `input` - the prompt(s) to send to the agent - **Common fields**: `ground_truth` (expected answer), `tags` (for filtering), `metadata` (extra info) - **Advanced fields**: `agent_args` (customize agent per sample), `rubric_vars` (per-sample rubric context) - **Multi-turn support**: Send multiple messages in sequence using arrays

**Typical workflow:**

1. Create a JSONL or CSV file with test cases
2. Reference it in your suite YAML: `dataset: test_cases.jsonl`
3. Run evaluation - each sample is tested independently
4. Results show per-sample and aggregate scores

Datasets can be created in two formats: **JSONL** or **CSV**. Choose based on your team’s workflow and complexity needs.

## Dataset Formats

### JSONL Format

Each line is a JSON object representing one test case:

```
{"input": "What's the capital of France?", "ground_truth": "Paris"}
{"input": "Calculate 2+2", "ground_truth": "4"}
{"input": "What color is the sky?", "ground_truth": "blue"}
```

**Best for:**

- Complex data structures (nested objects, arrays)
- Multi-turn conversations
- Advanced features (agent\_args, rubric\_vars)
- Teams comfortable with JSON/code
- Version control (clean line-by-line diffs)

### CSV Format

Standard CSV with headers:

```
input,ground_truth
"What's the capital of France?","Paris"
"Calculate 2+2","4"
"What color is the sky?","blue"
```

**Best for:**

- Simple question-answer pairs
- Teams that prefer spreadsheets (Excel, Google Sheets)
- Non-technical collaborators creating test cases
- Quick dataset creation and editing
- Easy sharing with non-developers

## Quick Reference

| Field          | Required | Type             | Purpose                              |
| -------------- | -------- | ---------------- | ------------------------------------ |
| `input`        | ✅        | string or array  | Prompt(s) to send to agent           |
| `ground_truth` | ❌        | string           | Expected answer (for tool graders)   |
| `tags`         | ❌        | array of strings | For filtering samples                |
| `agent_args`   | ❌        | object           | Per-sample agent customization       |
| `rubric_vars`  | ❌        | object           | Per-sample rubric variables          |
| `metadata`     | ❌        | object           | Arbitrary extra data                 |
| `id`           | ❌        | integer          | Sample ID (auto-assigned if omitted) |

## Field Reference

### Required Fields

#### input

The prompt(s) to send to the agent. Can be a string or array of strings:

Single message:

```
{ "input": "Hello, who are you?" }
```

Multi-turn conversation:

```
{ "input": ["Hello", "What's your name?", "Tell me about yourself"] }
```

### Optional Fields

#### ground\_truth

The expected answer or content to check against. Required for most tool graders (exact\_match, contains, etc.):

```
{ "input": "What is 2+2?", "ground_truth": "4" }
```

#### metadata

Arbitrary additional data about the sample:

```
{
  "input": "What is photosynthesis?",
  "ground_truth": "process where plants convert light into energy",
  "metadata": {
    "category": "biology",
    "difficulty": "medium"
  }
}
```

#### tags

List of tags for filtering samples:

```
{ "input": "Solve x^2 = 16", "ground_truth": "4", "tags": ["math", "algebra"] }
```

Filter by tags in your suite:

```
sample_tags: [math] # Only samples tagged "math" will be evaluated
```

#### agent\_args

Custom arguments passed to programmatic agent creation when using `agent_script`. Allows per-sample agent customization.

JSONL:

```
{
  "input": "What items do we have?",
  "agent_args": {
    "item": { "sku": "SKU-123", "name": "Widget A", "price": 19.99 }
  }
}
```

CSV:

```
input,agent_args
"What items do we have?","{""item"": {""sku"": ""SKU-123"", ""name"": ""Widget A"", ""price"": 19.99}}"
```

Your agent factory function can access these values via `sample.agent_args` to customize agent configuration.

See [Targets - agent\_script](/guides/evals/concepts/targets#agent_script/index.md) for details on programmatic agent creation.

#### rubric\_vars

Variables to inject into rubric templates when using rubric graders. This allows you to provide per-sample context or examples to the LLM judge.

**Example:** Evaluating code quality against a reference implementation.

JSONL:

```
{
  "input": "Write a function to calculate fibonacci numbers",
  "rubric_vars": {
    "reference_code": "def fib(n):\n    if n <= 1: return n\n    return fib(n-1) + fib(n-2)",
    "required_features": "recursion, base case"
  }
}
```

CSV:

```
input,rubric_vars
"Write a function to calculate fibonacci numbers","{""reference_code"": ""def fib(n):\n    if n <= 1: return n\n    return fib(n-1) + fib(n-2)"", ""required_features"": ""recursion, base case""}"
```

In your rubric template file, reference variables with `{variable_name}`:

**rubric.txt:**

```
Evaluate the submitted code against this reference implementation:


{reference_code}


Required features: {required_features}


Score on correctness (0.6) and code quality (0.4).
```

When the rubric grader runs, variables are replaced with values from `rubric_vars`:

**Final formatted prompt sent to LLM:**

```
Evaluate the submitted code against this reference implementation:


def fib(n):
    if n <= 1: return n
    return fib(n-1) + fib(n-2)


Required features: recursion, base case


Score on correctness (0.6) and code quality (0.4).
```

This lets you customize evaluation criteria per sample using the same rubric template.

See [Rubric Graders](/guides/evals/graders/rubric-graders/index.md) for details on rubric templates.

#### id

Sample ID is automatically assigned (0-based index) if not provided. You can override:

```
{ "id": 42, "input": "Test case 42" }
```

## Complete Example

```
{"id": 1, "input": "What is the capital of France?", "ground_truth": "Paris", "tags": ["geography", "easy"], "metadata": {"region": "Europe"}}
{"id": 2, "input": "Calculate the square root of 144", "ground_truth": "12", "tags": ["math", "medium"]}
{"id": 3, "input": ["Hello", "What can you help me with?"], "tags": ["conversation"]}
```

## Dataset Best Practices

### 1. Clear Ground Truth

Make ground truth specific enough to grade but flexible enough to match valid responses:

Good:

```
{"input": "What's the largest planet?", "ground_truth": "Jupiter"}
```

Too strict (might miss valid answers):

```
{"input": "What's the largest planet?", "ground_truth": "Jupiter is the largest planet in our solar system."}
```

### 2. Diverse Test Cases

Include edge cases and variations:

```
{"input": "What is 2+2?", "ground_truth": "4", "tags": ["math", "easy"]}
{"input": "What is 0.1 + 0.2?", "ground_truth": "0.3", "tags": ["math", "floating_point"]}
{"input": "What is 999999999 + 1?", "ground_truth": "1000000000", "tags": ["math", "large_numbers"]}
```

### 3. Use Tags for Organization

Organize samples by type, difficulty, or feature:

```
{"tags": ["tool_usage", "search"]}
{"tags": ["memory", "recall"]}
{"tags": ["reasoning", "multi_step"]}
```

### 4. Multi-Turn Conversations

Test conversational context and memory updates:

```
{"input": ["My name is Alice", "What's my name?"], "ground_truth": "Alice", "tags": ["memory", "recall"]}
{"input": ["Please remember that I like bananas.", "Actually, sorry, I meant I like apples."], "ground_truth": "apples", "tags": ["memory", "correction"]}
{"input": ["I work at Google", "Update my workplace to Microsoft", "Where do I work?"], "ground_truth": "Microsoft", "tags": ["memory", "multi_step"]}
```

**Testing memory corrections:** Use multi-turn inputs to test if agents properly update memory when users correct themselves. Combine with the `memory_block` extractor to verify the final memory state, not just the response.

### 5. No Ground Truth for LLM Judges

If using rubric graders, ground truth is optional:

```
{"input": "Write a creative story about a robot", "tags": ["creative"]}
{"input": "Explain quantum computing simply", "tags": ["explanation"]}
```

The LLM judge evaluates based on the rubric, not ground truth.

## Loading Datasets

Datasets are automatically loaded by the runner:

```
dataset: path/to/dataset.jsonl # Path to your test cases (JSONL or CSV)
```

Paths are relative to the suite YAML file location.

## Dataset Filtering

### Limit Sample Count

```
max_samples: 10 # Only evaluate first 10 samples (useful for testing)
```

### Filter by Tags

```
sample_tags: [math, medium] # Only samples with ALL these tags
```

## Creating Datasets Programmatically

You can generate datasets with Python:

```
import json


samples = []
for i in range(100):
    samples.append({
        "input": f"What is {i} + {i}?",
        "ground_truth": str(i + i),
        "tags": ["math", "addition"]
    })


with open("dataset.jsonl", "w") as f:
    for sample in samples:
        f.write(json.dumps(sample) + "\n")
```

## Dataset Format Validation

The runner validates:

- Each line is valid JSON
- Required fields are present
- Field types are correct

Validation errors will be reported with line numbers.

## Examples by Use Case

### Question Answering

JSONL:

```
{"input": "What is the capital of France?", "ground_truth": "Paris"}
{"input": "Who wrote Romeo and Juliet?", "ground_truth": "Shakespeare"}
```

CSV:

```
input,ground_truth
"What is the capital of France?","Paris"
"Who wrote Romeo and Juliet?","Shakespeare"
```

### Tool Usage Testing

JSONL:

```
{"input": "Search for information about pandas", "ground_truth": "search"}
{"input": "Calculate 15 * 23", "ground_truth": "calculator"}
```

CSV:

```
input,ground_truth
"Search for information about pandas","search"
"Calculate 15 * 23","calculator"
```

Ground truth = expected tool name.

### Memory Testing (Multi-turn)

JSONL:

```
{"input": ["Remember that my favorite color is blue", "What's my favorite color?"], "ground_truth": "blue"}
{"input": ["I live in Tokyo", "Where do I live?"], "ground_truth": "Tokyo"}
```

CSV (using JSON array strings):

```
input,ground_truth
"[""Remember that my favorite color is blue"", ""What's my favorite color?""]","blue"
"[""I live in Tokyo"", ""Where do I live?""]","Tokyo"
```

### Code Generation

JSONL:

```
{"input": "Write a function to reverse a string in Python"}
{"input": "Create a SQL query to find users older than 21"}
```

CSV:

```
input
"Write a function to reverse a string in Python"
"Create a SQL query to find users older than 21"
```

Use rubric graders to evaluate code quality.

## CSV Advanced Features

CSV supports all the same features as JSONL by encoding complex data as JSON strings in cells:

**Multi-turn conversations** (requires escaped JSON array string):

```
input,ground_truth
"[""Hello"", ""What's your name?""]","Alice"
```

**Agent arguments** (requires escaped JSON object string):

```
input,agent_args
"What items do we have?","{""initial_inventory"": [""apple"", ""banana""]}"
```

**Rubric variables** (requires escaped JSON object string):

```
input,rubric_vars
"Write a story","{""max_length"": 500, ""genre"": ""sci-fi""}"
```

**Note:** Complex data structures require JSON encoding in CSV. If you’re frequently using these advanced features, JSONL may be easier to read and maintain.

## Next Steps

- [Suite YAML Reference](/guides/evals/configuration/suite-yaml/index.md) - Complete configuration options including filtering
- [Graders](/guides/evals/concepts/graders/index.md) - How to evaluate agent responses
- [Multi-Turn Conversations](/guides/evals/advanced/multi-turn-conversations/index.md) - Testing conversational flows
