---
title: Suites | Letta Docs
---

A **suite** is a YAML configuration file that defines a complete evaluation specification. It’s the central piece that ties together your dataset, target agent, grading criteria, and pass/fail thresholds.

**Quick overview:** - **Single file defines everything**: Dataset, agent, graders, and success criteria all in one YAML - **Reusable and shareable**: Version control your evaluation specs alongside your code - **Multi-metric support**: Evaluate multiple aspects (accuracy, quality, tool usage) in one run - **Multi-model testing**: Run the same suite across different LLM models

- **Flexible filtering**: Test subsets using tags or sample limits

**Typical workflow:**

1. Create a suite YAML defining what and how to test
2. Run `letta-evals run suite.yaml`
3. Review results showing scores for each metric
4. Suite passes or fails based on gate criteria

An evaluation suite is a YAML configuration file that defines a complete test specification.

## Basic Structure

```
name: my-evaluation # Suite identifier
description: Optional description of what this tests # Human-readable explanation
dataset: path/to/dataset.jsonl # Test cases


target: # What agent to evaluate
  kind: agent
  agent_file: agent.af # Agent to test
  base_url: https://api.letta.com # Letta server


graders: # How to evaluate responses
  my_metric:
    kind: tool # Deterministic grading
    function: exact_match # Grading function
    extractor: last_assistant # What to extract from agent response


gate: # Pass/fail criteria
  metric_key: my_metric # Which metric to check
  op: gte # Greater than or equal
  value: 0.8 # 80% threshold
```

## Required Fields

### name

The name of your evaluation suite. Used in output and results.

```
name: question-answering-eval
```

### dataset

Path to the JSONL or CSV dataset file. Can be relative (to the suite YAML) or absolute.

```
dataset: ./datasets/qa.jsonl # Relative to suite YAML location
```

### target

Specifies what agent to evaluate. See [Targets](/guides/evals/concepts/targets/index.md) for details.

### graders

One or more graders to evaluate agent performance. See [Graders](/guides/evals/concepts/graders/index.md) for details.

### gate

Pass/fail criteria. See [Gates](/guides/evals/concepts/gates/index.md) for details.

## Optional Fields

### description

A human-readable description of what this suite tests:

```
description: Tests the agent's ability to answer factual questions accurately
```

### max\_samples

Limit the number of samples to evaluate (useful for quick tests):

```
max_samples: 10 # Only evaluate first 10 samples
```

### sample\_tags

Filter samples by tags (only evaluate samples with these tags):

```
sample_tags: [math, easy] # Only samples tagged with "math" AND "easy"
```

Dataset samples can include tags:

```
{
  "input": "What is 2+2?",
  "ground_truth": "4",
  "tags": [
    "math",
    "easy"
  ]
}
```

### num\_runs

Number of times to run the entire evaluation suite (useful for testing non-deterministic behavior):

```
num_runs: 5 # Run the evaluation 5 times
```

Default: 1

### setup\_script

Path to a Python script with a setup function to run before evaluation:

```
setup_script: setup.py:prepare_environment # script.py:function_name
```

The setup function should have this signature:

```
def prepare_environment(suite: SuiteSpec) -> None:
    # Setup code here
    pass
```

## Path Resolution

Paths in the suite YAML are resolved relative to the YAML file location:

```
project/
├── suite.yaml
├── dataset.jsonl
└── agents/
    └── my_agent.af
```

```
# In suite.yaml
dataset: dataset.jsonl # Resolves to project/dataset.jsonl
target:
  agent_file: agents/my_agent.af # Resolves to project/agents/my_agent.af
```

Absolute paths are used as-is.

## Multi-Grader Suites

You can evaluate multiple metrics in one suite:

```
graders:
  accuracy: # Check if answer is correct
    kind: tool
    function: exact_match
    extractor: last_assistant


  completeness: # LLM judges response quality
    kind: rubric
    prompt_path: rubrics/completeness.txt
    model: gpt-4o-mini
    extractor: last_assistant


  tool_usage: # Verify correct tool was called
    kind: tool
    function: contains
    extractor: tool_arguments # Extract tool call arguments
```

The gate can check any of these metrics:

```
gate:
  metric_key: accuracy # Gate on accuracy metric (others still computed)
  op: gte # Greater than or equal
  value: 0.9 # 90% threshold
```

Results will include scores for all graders, even if you only gate on one.

## Examples

### Simple Tool Grader Suite

```
name: basic-qa # Suite name
dataset: questions.jsonl # Test questions


target:
  kind: agent
  agent_file: qa_agent.af # Pre-configured agent
  base_url: https://api.letta.com # Local server


graders:
  accuracy: # Single metric
    kind: tool # Deterministic grading
    function: contains # Check if ground truth is in response
    extractor: last_assistant # Use final agent message


gate:
  metric_key: accuracy # Gate on this metric
  op: gte # Must be >=
  value: 0.75 # 75% to pass
```

### Rubric Grader Suite

```
name: quality-eval # Quality evaluation
dataset: prompts.jsonl # Test prompts


target:
  kind: agent
  agent_id: existing-agent-123 # Use existing agent
  base_url: https://api.letta.com # Letta API


graders:
  quality: # LLM-as-judge metric
    kind: rubric # Subjective evaluation
    prompt_path: quality_rubric.txt # Rubric template
    model: gpt-4o-mini # Judge model
    temperature: 0.0 # Deterministic
    extractor: last_assistant # Evaluate final response


gate:
  metric_key: quality # Gate on this metric
  metric: avg_score # Use average score
  op: gte # Must be >=
  value: 0.7 # 70% to pass
```

### Multi-Model Suite

Test the same agent configuration across different models:

```
name: model-comparison # Compare model performance
dataset: test.jsonl # Same test for all models


target:
  kind: agent
  agent_file: agent.af # Same agent configuration
  base_url: https://api.letta.com # Local server
  model_configs: [gpt-4o-mini, claude-3-5-sonnet] # Test both models


graders:
  accuracy: # Single metric for comparison
    kind: tool
    function: exact_match
    extractor: last_assistant


gate:
  metric_key: accuracy # Both models must pass this
  op: gte # Must be >=
  value: 0.8 # 80% threshold
```

Results will show per-model metrics.

## Validation

Validate your suite configuration before running:

Terminal window

```
letta-evals validate suite.yaml
```

This checks:

- Required fields are present
- Paths exist
- Configuration is valid
- Grader/extractor combinations are compatible

## Next Steps

- [Dataset Configuration](/guides/evals/concepts/datasets/index.md)
- [Target Configuration](/guides/evals/concepts/targets/index.md)
- [Grader Configuration](/guides/evals/concepts/graders/index.md)
- [Gate Configuration](/guides/evals/concepts/gates/index.md)
