Skip to content
Sign up

Suite YAML Reference

Complete reference for suite configuration files.

A suite is a YAML file that defines an evaluation: what agent to test, what dataset to use, how to grade responses, and what criteria determine pass/fail. This is your evaluation specification.

Quick overview:

  • name: Identifier for your evaluation
  • dataset: JSONL file with test cases
  • target: Which agent to evaluate (via file, ID, or script)
  • graders: How to score responses (tool or rubric graders)
  • gate: Pass/fail criteria

See Getting Started for a tutorial, or Core Concepts for conceptual overview.

name: string (required)
description: string (optional)
dataset: path (required)
max_samples: integer (optional)
sample_tags: array (optional)
num_runs: integer (optional)
setup_script: string (optional)
target: object (required)
kind: "agent"
base_url: string
api_key: string
timeout: float
project_id: string
agent_id: string (one of: agent_id, agent_file, agent_script)
agent_file: path
agent_script: string
model_configs: array
model_handles: array
graders: object (required)
<metric_key>: object
kind: "tool" | "rubric"
display_name: string
extractor: string
extractor_config: object
# Tool grader fields
function: string
# Rubric grader fields (LLM API)
prompt: string
prompt_path: path
model: string
temperature: float
provider: string
max_retries: integer
timeout: float
rubric_vars: array
# Rubric grader fields (agent-as-judge)
agent_file: path
judge_tool_name: string
gate: object (required)
metric_key: string
metric: "avg_score" | "accuracy"
op: "gte" | "gt" | "lte" | "lt" | "eq"
value: float
pass_op: "gte" | "gt" | "lte" | "lt" | "eq"
pass_value: float

Suite name, used in output and results.

Type: string

Example:

name: question-answering-eval

Human-readable description of what the suite tests.

Type: string

Example:

description: Tests agent's ability to answer factual questions accurately

Path to JSONL dataset file. Relative paths are resolved from the suite YAML location.

Type: path (string)

Example:

dataset: ./datasets/qa.jsonl
dataset: /absolute/path/to/dataset.jsonl

Limit the number of samples to evaluate. Useful for quick tests.

Type: integer

Default: All samples

Example:

max_samples: 10 # Only evaluate first 10 samples

Filter samples by tags. Only samples with ALL specified tags are evaluated.

Type: array of strings

Example:

sample_tags: [math, easy] # Only samples tagged with both

Dataset samples need tags:

{
"input": "What is 2+2?",
"ground_truth": "4",
"tags": [
"math",
"easy"
]
}

Number of times to run the evaluation suite. Useful for testing non-deterministic behavior or collecting multiple runs for statistical analysis.

Type: integer

Default: 1

Example:

num_runs: 5 # Run the evaluation 5 times

Path to Python script with setup function.

Type: string (format: path/to/script.py:function_name)

Example:

setup_script: setup.py:prepare_environment

The function signature:

def prepare_environment(suite: SuiteSpec) -> None:
# Setup code
pass

Configuration for the agent being evaluated.

Type of target. Currently only "agent" is supported.

Type: string

Example:

target:
kind: agent

Letta server URL.

Type: string

Default: http://localhost:8283

Example:

target:
base_url: http://localhost:8283
# or
base_url: https://api.letta.com

API key for Letta authentication. Can also be set via LETTA_API_KEY environment variable.

Type: string

Example:

target:
api_key: your-api-key-here

Request timeout in seconds.

Type: float

Default: 300.0

Example:

target:
timeout: 600.0 # 10 minutes

Letta project ID (for Letta Cloud).

Type: string

Example:

target:
project_id: proj_abc123

Exactly one of these must be specified:

ID of existing agent on the server.

Type: string

Example:

target:
agent_id: agent-123-abc

Path to .af agent file.

Type: path (string, must end in .af)

Example:

target:
agent_file: ./agents/my_agent.af

Path to Python script with agent factory.

Type: string (format: path/to/script.py:ClassName)

Example:

target:
agent_script: factory.py:MyAgentFactory

See Targets for details on agent sources.

List of model configuration names to test. Cannot be used with model_handles.

Type: array of strings

Example:

target:
model_configs: [gpt-4o-mini, claude-3-5-sonnet]

List of model handles for cloud deployments. Cannot be used with model_configs.

Type: array of strings

Example:

target:
model_handles: ["openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"]

One or more graders, each with a unique key.

The key becomes the metric name:

graders:
accuracy: # This is the metric_key
kind: tool
...
quality: # Another metric_key
kind: rubric
...

Grader type: "tool" or "rubric".

Type: string

Example:

graders:
my_metric:
kind: tool

Human-friendly name for CLI/UI output.

Type: string

Example:

graders:
acc:
display_name: "Answer Accuracy"
kind: tool
...

Name of the extractor to use.

Type: string

Example:

graders:
my_metric:
extractor: last_assistant

Configuration passed to the extractor.

Type: object

Example:

graders:
my_metric:
extractor: pattern
extractor_config:
pattern: "Answer: (.*)"
group: 1

Name of the grading function.

Type: string

Example:

graders:
accuracy:
kind: tool
function: exact_match

Inline rubric prompt.

Type: string

Example:

graders:
quality:
kind: rubric
prompt: |
Evaluate response quality from 0.0 to 1.0.
Input: {input}
Response: {submission}

Path to rubric file. Cannot use both prompt and prompt_path.

Type: path (string)

Example:

graders:
quality:
kind: rubric
prompt_path: rubrics/quality.txt

LLM model for judging.

Type: string

Default: gpt-4o-mini

Example:

graders:
quality:
kind: rubric
model: gpt-4o

Temperature for LLM generation.

Type: float (0.0 to 2.0)

Default: 0.0

Example:

graders:
quality:
kind: rubric
temperature: 0.0

LLM provider.

Type: string

Default: openai

Example:

graders:
quality:
kind: rubric
provider: openai

Maximum retry attempts for API calls.

Type: integer

Default: 5

Example:

graders:
quality:
kind: rubric
max_retries: 3

Timeout for API calls in seconds.

Type: float

Default: 120.0

Example:

graders:
quality:
kind: rubric
timeout: 60.0

List of custom variable names that must be provided in the dataset for rubric template substitution. When specified, the grader validates that each sample includes these variables in its rubric_vars field.

Type: array of strings

Example:

graders:
code_quality:
kind: rubric
rubric_vars: [reference_code, required_features] # Require these variables in dataset
prompt: |
Compare the submission to this reference:
{reference_code}
Required features: {required_features}

Dataset sample must provide these variables:

{
"input": "Write a fibonacci function",
"rubric_vars": {
"reference_code": "def fib(n):\n if n <= 1: return n\n return fib(n-1) + fib(n-2)",
"required_features": "recursion, base case"
}
}

See Datasets - rubric_vars for details.

Path to .af agent file to use as judge for rubric grading. Use this instead of model when you want a Letta agent to act as the evaluator.

Type: path (string)

Mutually exclusive with: model, temperature, provider, max_retries, timeout

Example:

graders:
agent_judge:
kind: rubric
agent_file: judge.af # Judge agent with submit_grade tool
prompt_path: rubric.txt # Evaluation criteria
extractor: last_assistant

Requirements: The judge agent must have a tool with signature submit_grade(score: float, rationale: str). The framework validates this on initialization.

See Rubric Graders - Agent-as-Judge for complete documentation.

judge_tool_name (optional, for agent-as-judge)

Section titled “judge_tool_name (optional, for agent-as-judge)”

Name of the tool that the judge agent uses to submit scores. Only applicable when using agent_file.

Type: string

Default: submit_grade

Example:

graders:
agent_judge:
kind: rubric
agent_file: judge.af
judge_tool_name: submit_grade # Default, can be omitted
prompt_path: rubric.txt
extractor: last_assistant

Tool requirements: The tool must have exactly two parameters:

  • score: float - Score between 0.0 and 1.0
  • rationale: str - Explanation of the score

Pass/fail criteria for the evaluation.

Which grader to evaluate. If only one grader, this can be omitted.

Type: string

Example:

gate:
metric_key: accuracy # Must match a key in graders

Which aggregate to compare: avg_score or accuracy.

Type: string

Default: avg_score

Example:

gate:
metric: avg_score
# or
metric: accuracy

Comparison operator.

Type: string (one of: gte, gt, lte, lt, eq)

Example:

gate:
op: gte # Greater than or equal

Threshold value for comparison.

Type: float (0.0 to 1.0)

Example:

gate:
value: 0.8 # Require >= 0.8

Comparison operator for per-sample pass criteria.

Type: string (one of: gte, gt, lte, lt, eq)

Default: Same as op

Example:

gate:
metric: accuracy
pass_op: gte # Sample passes if...
pass_value: 0.7 # ...score >= 0.7

Threshold for per-sample pass.

Type: float (0.0 to 1.0)

Default: Same as value (or 1.0 for accuracy metric)

Example:

gate:
metric: accuracy
op: gte
value: 0.8 # 80% must pass
pass_op: gte
pass_value: 0.7 # Sample passes if score >= 0.7
name: basic-eval
dataset: dataset.jsonl
target:
kind: agent
agent_file: agent.af
graders:
accuracy:
kind: tool
function: exact_match
extractor: last_assistant
gate:
op: gte
value: 0.8
name: comprehensive-eval
description: Tests accuracy and quality
dataset: test_data.jsonl
max_samples: 100
target:
kind: agent
agent_file: agent.af
base_url: http://localhost:8283
graders:
accuracy:
display_name: "Answer Accuracy"
kind: tool
function: contains
extractor: last_assistant
quality:
display_name: "Response Quality"
kind: rubric
prompt_path: rubrics/quality.txt
model: gpt-4o-mini
temperature: 0.0
extractor: last_assistant
gate:
metric_key: accuracy
metric: avg_score
op: gte
value: 0.85
name: advanced-eval
description: Multi-model, multi-metric evaluation
dataset: comprehensive_tests.jsonl
sample_tags: [production]
setup_script: setup.py:prepare
target:
kind: agent
agent_script: factory.py:CustomFactory
base_url: https://api.letta.com
api_key: ${LETTA_API_KEY}
project_id: proj_abc123
model_configs: [gpt-4o-mini, claude-3-5-sonnet]
graders:
answer:
kind: tool
function: exact_match
extractor: last_assistant
tool_usage:
kind: tool
function: contains
extractor: tool_arguments
extractor_config:
tool_name: search
memory:
kind: tool
function: contains
extractor: memory_block
extractor_config:
block_label: human
gate:
metric_key: answer
metric: accuracy
op: gte
value: 0.9
pass_op: gte
pass_value: 1.0

Validate your suite before running:

Terminal window
letta-evals validate suite.yaml
  • Targets - Understanding agent sources and configuration
  • Graders - Tool graders vs rubric graders
  • Extractors - What to extract from agent responses
  • Gates - Setting pass/fail criteria