Suite YAML Reference
Suite YAML Reference
Section titled “Suite YAML Reference”Complete reference for suite configuration files.
A suite is a YAML file that defines an evaluation: what agent to test, what dataset to use, how to grade responses, and what criteria determine pass/fail. This is your evaluation specification.
Quick overview:
- name: Identifier for your evaluation
- dataset: JSONL file with test cases
- target: Which agent to evaluate (via file, ID, or script)
- graders: How to score responses (tool or rubric graders)
- gate: Pass/fail criteria
See Getting Started for a tutorial, or Core Concepts for conceptual overview.
File Structure
Section titled “File Structure”name: string (required)description: string (optional)dataset: path (required)max_samples: integer (optional)sample_tags: array (optional)num_runs: integer (optional)setup_script: string (optional)
target: object (required) kind: "agent" base_url: string api_key: string timeout: float project_id: string agent_id: string (one of: agent_id, agent_file, agent_script) agent_file: path agent_script: string model_configs: array model_handles: array
graders: object (required) <metric_key>: object kind: "tool" | "rubric" display_name: string extractor: string extractor_config: object # Tool grader fields function: string # Rubric grader fields (LLM API) prompt: string prompt_path: path model: string temperature: float provider: string max_retries: integer timeout: float rubric_vars: array # Rubric grader fields (agent-as-judge) agent_file: path judge_tool_name: string
gate: object (required) metric_key: string metric: "avg_score" | "accuracy" op: "gte" | "gt" | "lte" | "lt" | "eq" value: float pass_op: "gte" | "gt" | "lte" | "lt" | "eq" pass_value: floatTop-Level Fields
Section titled “Top-Level Fields”name (required)
Section titled “name (required)”Suite name, used in output and results.
Type: string
Example:
name: question-answering-evaldescription (optional)
Section titled “description (optional)”Human-readable description of what the suite tests.
Type: string
Example:
description: Tests agent's ability to answer factual questions accuratelydataset (required)
Section titled “dataset (required)”Path to JSONL dataset file. Relative paths are resolved from the suite YAML location.
Type: path (string)
Example:
dataset: ./datasets/qa.jsonldataset: /absolute/path/to/dataset.jsonlmax_samples (optional)
Section titled “max_samples (optional)”Limit the number of samples to evaluate. Useful for quick tests.
Type: integer
Default: All samples
Example:
max_samples: 10 # Only evaluate first 10 samplessample_tags (optional)
Section titled “sample_tags (optional)”Filter samples by tags. Only samples with ALL specified tags are evaluated.
Type: array of strings
Example:
sample_tags: [math, easy] # Only samples tagged with bothDataset samples need tags:
{ "input": "What is 2+2?", "ground_truth": "4", "tags": [ "math", "easy" ]}num_runs (optional)
Section titled “num_runs (optional)”Number of times to run the evaluation suite. Useful for testing non-deterministic behavior or collecting multiple runs for statistical analysis.
Type: integer
Default: 1
Example:
num_runs: 5 # Run the evaluation 5 timessetup_script (optional)
Section titled “setup_script (optional)”Path to Python script with setup function.
Type: string (format: path/to/script.py:function_name)
Example:
setup_script: setup.py:prepare_environmentThe function signature:
def prepare_environment(suite: SuiteSpec) -> None: # Setup code passtarget (required)
Section titled “target (required)”Configuration for the agent being evaluated.
kind (required)
Section titled “kind (required)”Type of target. Currently only "agent" is supported.
Type: string
Example:
target: kind: agentbase_url (optional)
Section titled “base_url (optional)”Letta server URL.
Type: string
Default: http://localhost:8283
Example:
target: base_url: http://localhost:8283 # or base_url: https://api.letta.comapi_key (optional)
Section titled “api_key (optional)”API key for Letta authentication. Can also be set via LETTA_API_KEY environment variable.
Type: string
Example:
target: api_key: your-api-key-heretimeout (optional)
Section titled “timeout (optional)”Request timeout in seconds.
Type: float
Default: 300.0
Example:
target: timeout: 600.0 # 10 minutesproject_id (optional)
Section titled “project_id (optional)”Letta project ID (for Letta Cloud).
Type: string
Example:
target: project_id: proj_abc123Agent Source (required, pick one)
Section titled “Agent Source (required, pick one)”Exactly one of these must be specified:
agent_id
Section titled “agent_id”ID of existing agent on the server.
Type: string
Example:
target: agent_id: agent-123-abcagent_file
Section titled “agent_file”Path to .af agent file.
Type: path (string, must end in .af)
Example:
target: agent_file: ./agents/my_agent.afagent_script
Section titled “agent_script”Path to Python script with agent factory.
Type: string (format: path/to/script.py:ClassName)
Example:
target: agent_script: factory.py:MyAgentFactorySee Targets for details on agent sources.
model_configs (optional)
Section titled “model_configs (optional)”List of model configuration names to test. Cannot be used with model_handles.
Type: array of strings
Example:
target: model_configs: [gpt-4o-mini, claude-3-5-sonnet]model_handles (optional)
Section titled “model_handles (optional)”List of model handles for cloud deployments. Cannot be used with model_configs.
Type: array of strings
Example:
target: model_handles: ["openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"]graders (required)
Section titled “graders (required)”One or more graders, each with a unique key.
Grader Key
Section titled “Grader Key”The key becomes the metric name:
graders: accuracy: # This is the metric_key kind: tool ... quality: # Another metric_key kind: rubric ...kind (required)
Section titled “kind (required)”Grader type: "tool" or "rubric".
Type: string
Example:
graders: my_metric: kind: tooldisplay_name (optional)
Section titled “display_name (optional)”Human-friendly name for CLI/UI output.
Type: string
Example:
graders: acc: display_name: "Answer Accuracy" kind: tool ...extractor (required)
Section titled “extractor (required)”Name of the extractor to use.
Type: string
Example:
graders: my_metric: extractor: last_assistantextractor_config (optional)
Section titled “extractor_config (optional)”Configuration passed to the extractor.
Type: object
Example:
graders: my_metric: extractor: pattern extractor_config: pattern: "Answer: (.*)" group: 1Tool Grader Fields
Section titled “Tool Grader Fields”function (required for tool graders)
Section titled “function (required for tool graders)”Name of the grading function.
Type: string
Example:
graders: accuracy: kind: tool function: exact_matchRubric Grader Fields
Section titled “Rubric Grader Fields”prompt (required if no prompt_path)
Section titled “prompt (required if no prompt_path)”Inline rubric prompt.
Type: string
Example:
graders: quality: kind: rubric prompt: | Evaluate response quality from 0.0 to 1.0. Input: {input} Response: {submission}prompt_path (required if no prompt)
Section titled “prompt_path (required if no prompt)”Path to rubric file. Cannot use both prompt and prompt_path.
Type: path (string)
Example:
graders: quality: kind: rubric prompt_path: rubrics/quality.txtmodel (optional)
Section titled “model (optional)”LLM model for judging.
Type: string
Default: gpt-4o-mini
Example:
graders: quality: kind: rubric model: gpt-4otemperature (optional)
Section titled “temperature (optional)”Temperature for LLM generation.
Type: float (0.0 to 2.0)
Default: 0.0
Example:
graders: quality: kind: rubric temperature: 0.0provider (optional)
Section titled “provider (optional)”LLM provider.
Type: string
Default: openai
Example:
graders: quality: kind: rubric provider: openaimax_retries (optional)
Section titled “max_retries (optional)”Maximum retry attempts for API calls.
Type: integer
Default: 5
Example:
graders: quality: kind: rubric max_retries: 3timeout (optional)
Section titled “timeout (optional)”Timeout for API calls in seconds.
Type: float
Default: 120.0
Example:
graders: quality: kind: rubric timeout: 60.0rubric_vars (optional)
Section titled “rubric_vars (optional)”List of custom variable names that must be provided in the dataset for rubric template substitution. When specified, the grader validates that each sample includes these variables in its rubric_vars field.
Type: array of strings
Example:
graders: code_quality: kind: rubric rubric_vars: [reference_code, required_features] # Require these variables in dataset prompt: | Compare the submission to this reference: {reference_code}
Required features: {required_features}Dataset sample must provide these variables:
{ "input": "Write a fibonacci function", "rubric_vars": { "reference_code": "def fib(n):\n if n <= 1: return n\n return fib(n-1) + fib(n-2)", "required_features": "recursion, base case" }}See Datasets - rubric_vars for details.
agent_file (required for agent-as-judge)
Section titled “agent_file (required for agent-as-judge)”Path to .af agent file to use as judge for rubric grading. Use this instead of model when you want a Letta agent to act as the evaluator.
Type: path (string)
Mutually exclusive with: model, temperature, provider, max_retries, timeout
Example:
graders: agent_judge: kind: rubric agent_file: judge.af # Judge agent with submit_grade tool prompt_path: rubric.txt # Evaluation criteria extractor: last_assistantRequirements: The judge agent must have a tool with signature submit_grade(score: float, rationale: str). The framework validates this on initialization.
See Rubric Graders - Agent-as-Judge for complete documentation.
judge_tool_name (optional, for agent-as-judge)
Section titled “judge_tool_name (optional, for agent-as-judge)”Name of the tool that the judge agent uses to submit scores. Only applicable when using agent_file.
Type: string
Default: submit_grade
Example:
graders: agent_judge: kind: rubric agent_file: judge.af judge_tool_name: submit_grade # Default, can be omitted prompt_path: rubric.txt extractor: last_assistantTool requirements: The tool must have exactly two parameters:
score: float- Score between 0.0 and 1.0rationale: str- Explanation of the score
gate (required)
Section titled “gate (required)”Pass/fail criteria for the evaluation.
metric_key (optional)
Section titled “metric_key (optional)”Which grader to evaluate. If only one grader, this can be omitted.
Type: string
Example:
gate: metric_key: accuracy # Must match a key in gradersmetric (optional)
Section titled “metric (optional)”Which aggregate to compare: avg_score or accuracy.
Type: string
Default: avg_score
Example:
gate: metric: avg_score # or metric: accuracyop (required)
Section titled “op (required)”Comparison operator.
Type: string (one of: gte, gt, lte, lt, eq)
Example:
gate: op: gte # Greater than or equalvalue (required)
Section titled “value (required)”Threshold value for comparison.
Type: float (0.0 to 1.0)
Example:
gate: value: 0.8 # Require >= 0.8pass_op (optional)
Section titled “pass_op (optional)”Comparison operator for per-sample pass criteria.
Type: string (one of: gte, gt, lte, lt, eq)
Default: Same as op
Example:
gate: metric: accuracy pass_op: gte # Sample passes if... pass_value: 0.7 # ...score >= 0.7pass_value (optional)
Section titled “pass_value (optional)”Threshold for per-sample pass.
Type: float (0.0 to 1.0)
Default: Same as value (or 1.0 for accuracy metric)
Example:
gate: metric: accuracy op: gte value: 0.8 # 80% must pass pass_op: gte pass_value: 0.7 # Sample passes if score >= 0.7Complete Examples
Section titled “Complete Examples”Minimal Suite
Section titled “Minimal Suite”name: basic-evaldataset: dataset.jsonl
target: kind: agent agent_file: agent.af
graders: accuracy: kind: tool function: exact_match extractor: last_assistant
gate: op: gte value: 0.8Multi-Metric Suite
Section titled “Multi-Metric Suite”name: comprehensive-evaldescription: Tests accuracy and qualitydataset: test_data.jsonlmax_samples: 100
target: kind: agent agent_file: agent.af base_url: http://localhost:8283
graders: accuracy: display_name: "Answer Accuracy" kind: tool function: contains extractor: last_assistant
quality: display_name: "Response Quality" kind: rubric prompt_path: rubrics/quality.txt model: gpt-4o-mini temperature: 0.0 extractor: last_assistant
gate: metric_key: accuracy metric: avg_score op: gte value: 0.85Advanced Suite
Section titled “Advanced Suite”name: advanced-evaldescription: Multi-model, multi-metric evaluationdataset: comprehensive_tests.jsonlsample_tags: [production]setup_script: setup.py:prepare
target: kind: agent agent_script: factory.py:CustomFactory base_url: https://api.letta.com api_key: ${LETTA_API_KEY} project_id: proj_abc123 model_configs: [gpt-4o-mini, claude-3-5-sonnet]
graders: answer: kind: tool function: exact_match extractor: last_assistant
tool_usage: kind: tool function: contains extractor: tool_arguments extractor_config: tool_name: search
memory: kind: tool function: contains extractor: memory_block extractor_config: block_label: human
gate: metric_key: answer metric: accuracy op: gte value: 0.9 pass_op: gte pass_value: 1.0Validation
Section titled “Validation”Validate your suite before running:
letta-evals validate suite.yamlNext Steps
Section titled “Next Steps”- Targets - Understanding agent sources and configuration
- Graders - Tool graders vs rubric graders
- Extractors - What to extract from agent responses
- Gates - Setting pass/fail criteria