Suite YAML Reference
Complete reference for suite configuration files.
A suite is a YAML file that defines an evaluation: what agent to test, what dataset to use, how to grade responses, and what criteria determine pass/fail. This is your evaluation specification.
Quick overview:
- name: Identifier for your evaluation
- dataset: JSONL file with test cases
- target: Which agent to evaluate (via file, ID, or script)
- graders: How to score responses (tool or rubric graders)
- gate: Pass/fail criteria
See Getting Started for a tutorial, or Core Concepts for conceptual overview.
File Structure
Top-Level Fields
name (required)
Suite name, used in output and results.
Type: string
description (optional)
Human-readable description of what the suite tests.
Type: string
dataset (required)
Path to JSONL dataset file. Relative paths are resolved from the suite YAML location.
Type: path (string)
max_samples (optional)
Limit the number of samples to evaluate. Useful for quick tests.
Type: integer | Default: All samples
sample_tags (optional)
Filter samples by tags. Only samples with ALL specified tags are evaluated.
Type: array of strings
num_runs (optional)
Number of times to run the evaluation suite.
Type: integer | Default: 1
setup_script (optional)
Path to Python script with setup function.
Type: string (format: path/to/script.py:function_name
)
target (required)
Configuration for the agent being evaluated.
kind (required)
Type of target. Currently only "agent"
is supported.
base_url (optional)
Letta server URL. Default: http://localhost:8283
api_key (optional)
API key for Letta authentication. Can also be set via LETTA_API_KEY
environment variable.
timeout (optional)
Request timeout in seconds. Default: 300.0
Agent Source (required, pick one)
Exactly one of these must be specified:
agent_id
ID of existing agent on the server.
agent_file
Path to .af
agent file.
agent_script
Path to Python script with agent factory.
See Targets for details on agent sources.
model_configs (optional)
List of model configuration names to test. Cannot be used with model_handles
.
model_handles (optional)
List of model handles for cloud deployments. Cannot be used with model_configs
.
graders (required)
One or more graders, each with a unique key.
kind (required)
Grader type: "tool"
or "rubric"
.
extractor (required)
Name of the extractor to use.
Tool Grader Fields
function (required for tool graders)
Name of the grading function.
Rubric Grader Fields
prompt or prompt_path (required)
Inline rubric prompt or path to rubric file.
model (optional)
LLM model for judging. Default: gpt-4o-mini
temperature (optional)
Temperature for LLM generation. Default: 0.0
agent_file (agent-as-judge)
Path to .af
agent file to use as judge.
gate (required)
Pass/fail criteria for the evaluation.
metric_key (optional)
Which grader to evaluate. If only one grader, this can be omitted.
metric (optional)
Which aggregate to compare: avg_score
or accuracy
. Default: avg_score
op (required)
Comparison operator: gte
, gt
, lte
, lt
, eq
value (required)
Threshold value for comparison (0.0 to 1.0).
Complete Examples
Minimal Suite
Multi-Metric Suite
Validation
Validate your suite before running:
Next Steps
- Targets - Understanding agent sources and configuration
- Graders - Tool graders vs rubric graders
- Extractors - What to extract from agent responses
- Gates - Setting pass/fail criteria