Suites
A suite is a YAML configuration file that defines a complete evaluation specification. It’s the central piece that ties together your dataset, target agent, grading criteria, and pass/fail thresholds.
Quick overview:
- Single file defines everything: Dataset, agent, graders, and success criteria all in one YAML
- Reusable and shareable: Version control your evaluation specs alongside your code
- Multi-metric support: Evaluate multiple aspects (accuracy, quality, tool usage) in one run
- Multi-model testing: Run the same suite across different LLM models
- Flexible filtering: Test subsets using tags or sample limits
Typical workflow:
- Create a suite YAML defining what and how to test
- Run
letta-evals run suite.yaml - Review results showing scores for each metric
- Suite passes or fails based on gate criteria
An evaluation suite is a YAML configuration file that defines a complete test specification.
Basic Structure
Required Fields
name
The name of your evaluation suite. Used in output and results.
dataset
Path to the JSONL or CSV dataset file. Can be relative (to the suite YAML) or absolute.
target
Specifies what agent to evaluate. See Targets for details.
graders
One or more graders to evaluate agent performance. See Graders for details.
gate
Pass/fail criteria. See Gates for details.
Optional Fields
description
A human-readable description of what this suite tests:
max_samples
Limit the number of samples to evaluate (useful for quick tests):
sample_tags
Filter samples by tags (only evaluate samples with these tags):
Dataset samples can include tags:
num_runs
Number of times to run the entire evaluation suite (useful for testing non-deterministic behavior):
Default: 1
setup_script
Path to a Python script with a setup function to run before evaluation:
The setup function should have this signature:
Path Resolution
Paths in the suite YAML are resolved relative to the YAML file location:
Absolute paths are used as-is.
Multi-Grader Suites
You can evaluate multiple metrics in one suite:
The gate can check any of these metrics:
Results will include scores for all graders, even if you only gate on one.
Examples
Simple Tool Grader Suite
Rubric Grader Suite
Multi-Model Suite
Test the same agent configuration across different models:
Results will show per-model metrics.
Validation
Validate your suite configuration before running:
This checks:
- Required fields are present
- Paths exist
- Configuration is valid
- Grader/extractor combinations are compatible