Suite YAML Reference

Complete reference for suite configuration files.

A suite is a YAML file that defines an evaluation: what agent to test, what dataset to use, how to grade responses, and what criteria determine pass/fail. This is your evaluation specification.

Quick overview:

  • name: Identifier for your evaluation
  • dataset: JSONL file with test cases
  • target: Which agent to evaluate (via file, ID, or script)
  • graders: How to score responses (tool or rubric graders)
  • gate: Pass/fail criteria

See Getting Started for a tutorial, or Core Concepts for conceptual overview.

File Structure

1name: string (required)
2description: string (optional)
3dataset: path (required)
4max_samples: integer (optional)
5sample_tags: array (optional)
6num_runs: integer (optional)
7setup_script: string (optional)
8
9target: object (required)
10 kind: "agent"
11 base_url: string
12 api_key: string
13 timeout: float
14 project_id: string
15 agent_id: string (one of: agent_id, agent_file, agent_script)
16 agent_file: path
17 agent_script: string
18 model_configs: array
19 model_handles: array
20
21graders: object (required)
22 <metric_key>: object
23 kind: "tool" | "rubric"
24 display_name: string
25 extractor: string
26 extractor_config: object
27 # Tool grader fields
28 function: string
29 # Rubric grader fields (LLM API)
30 prompt: string
31 prompt_path: path
32 model: string
33 temperature: float
34 provider: string
35 max_retries: integer
36 timeout: float
37 rubric_vars: array
38 # Rubric grader fields (agent-as-judge)
39 agent_file: path
40 judge_tool_name: string
41
42gate: object (required)
43 metric_key: string
44 metric: "avg_score" | "accuracy"
45 op: "gte" | "gt" | "lte" | "lt" | "eq"
46 value: float
47 pass_op: "gte" | "gt" | "lte" | "lt" | "eq"
48 pass_value: float

Top-Level Fields

name (required)

Suite name, used in output and results.

Type: string

1name: question-answering-eval

description (optional)

Human-readable description of what the suite tests.

Type: string

1description: Tests agent's ability to answer factual questions accurately

dataset (required)

Path to JSONL dataset file. Relative paths are resolved from the suite YAML location.

Type: path (string)

1dataset: ./datasets/qa.jsonl
2dataset: /absolute/path/to/dataset.jsonl

max_samples (optional)

Limit the number of samples to evaluate. Useful for quick tests.

Type: integer | Default: All samples

1max_samples: 10 # Only evaluate first 10 samples

sample_tags (optional)

Filter samples by tags. Only samples with ALL specified tags are evaluated.

Type: array of strings

1sample_tags: [math, easy] # Only samples tagged with both

num_runs (optional)

Number of times to run the evaluation suite.

Type: integer | Default: 1

1num_runs: 5 # Run the evaluation 5 times

setup_script (optional)

Path to Python script with setup function.

Type: string (format: path/to/script.py:function_name)

1setup_script: setup.py:prepare_environment

target (required)

Configuration for the agent being evaluated.

kind (required)

Type of target. Currently only "agent" is supported.

1target:
2 kind: agent

base_url (optional)

Letta server URL. Default: http://localhost:8283

1target:
2 base_url: http://localhost:8283
3 # or
4 base_url: https://api.letta.com

api_key (optional)

API key for Letta authentication. Can also be set via LETTA_API_KEY environment variable.

1target:
2 api_key: your-api-key-here

timeout (optional)

Request timeout in seconds. Default: 300.0

1target:
2 timeout: 600.0 # 10 minutes

Agent Source (required, pick one)

Exactly one of these must be specified:

agent_id

ID of existing agent on the server.

1target:
2 agent_id: agent-123-abc

agent_file

Path to .af agent file.

1target:
2 agent_file: ./agents/my_agent.af

agent_script

Path to Python script with agent factory.

1target:
2 agent_script: factory.py:MyAgentFactory

See Targets for details on agent sources.

model_configs (optional)

List of model configuration names to test. Cannot be used with model_handles.

1target:
2 model_configs: [gpt-4o-mini, claude-3-5-sonnet]

model_handles (optional)

List of model handles for cloud deployments. Cannot be used with model_configs.

1target:
2 model_handles: ["openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"]

graders (required)

One or more graders, each with a unique key.

kind (required)

Grader type: "tool" or "rubric".

1graders:
2 my_metric:
3 kind: tool

extractor (required)

Name of the extractor to use.

1graders:
2 my_metric:
3 extractor: last_assistant

Tool Grader Fields

function (required for tool graders)

Name of the grading function.

1graders:
2 accuracy:
3 kind: tool
4 function: exact_match

Rubric Grader Fields

prompt or prompt_path (required)

Inline rubric prompt or path to rubric file.

1graders:
2 quality:
3 kind: rubric
4 prompt: |
5 Evaluate response quality from 0.0 to 1.0.

model (optional)

LLM model for judging. Default: gpt-4o-mini

1graders:
2 quality:
3 kind: rubric
4 model: gpt-4o

temperature (optional)

Temperature for LLM generation. Default: 0.0

1graders:
2 quality:
3 kind: rubric
4 temperature: 0.0

agent_file (agent-as-judge)

Path to .af agent file to use as judge.

1graders:
2 agent_judge:
3 kind: rubric
4 agent_file: judge.af
5 prompt_path: rubric.txt

gate (required)

Pass/fail criteria for the evaluation.

metric_key (optional)

Which grader to evaluate. If only one grader, this can be omitted.

1gate:
2 metric_key: accuracy

metric (optional)

Which aggregate to compare: avg_score or accuracy. Default: avg_score

1gate:
2 metric: avg_score

op (required)

Comparison operator: gte, gt, lte, lt, eq

1gate:
2 op: gte # Greater than or equal

value (required)

Threshold value for comparison (0.0 to 1.0).

1gate:
2 value: 0.8 # Require >= 0.8

Complete Examples

Minimal Suite

1name: basic-eval
2dataset: dataset.jsonl
3
4target:
5 kind: agent
6 agent_file: agent.af
7
8graders:
9 accuracy:
10 kind: tool
11 function: exact_match
12 extractor: last_assistant
13
14gate:
15 op: gte
16 value: 0.8

Multi-Metric Suite

1name: comprehensive-eval
2description: Tests accuracy and quality
3dataset: test_data.jsonl
4
5target:
6 kind: agent
7 agent_file: agent.af
8
9graders:
10 accuracy:
11 kind: tool
12 function: contains
13 extractor: last_assistant
14
15 quality:
16 kind: rubric
17 prompt_path: rubrics/quality.txt
18 model: gpt-4o-mini
19 extractor: last_assistant
20
21gate:
22 metric_key: accuracy
23 op: gte
24 value: 0.85

Validation

Validate your suite before running:

$letta-evals validate suite.yaml

Next Steps

  • Targets - Understanding agent sources and configuration
  • Graders - Tool graders vs rubric graders
  • Extractors - What to extract from agent responses
  • Gates - Setting pass/fail criteria