Suites

A suite is a YAML configuration file that defines a complete evaluation specification. It’s the central piece that ties together your dataset, target agent, grading criteria, and pass/fail thresholds.

Quick overview:

  • Single file defines everything: Dataset, agent, graders, and success criteria all in one YAML
  • Reusable and shareable: Version control your evaluation specs alongside your code
  • Multi-metric support: Evaluate multiple aspects (accuracy, quality, tool usage) in one run
  • Multi-model testing: Run the same suite across different LLM models
  • Flexible filtering: Test subsets using tags or sample limits

Typical workflow:

  1. Create a suite YAML defining what and how to test
  2. Run letta-evals run suite.yaml
  3. Review results showing scores for each metric
  4. Suite passes or fails based on gate criteria

An evaluation suite is a YAML configuration file that defines a complete test specification.

Basic Structure

1name: my-evaluation # Suite identifier
2description: Optional description of what this tests # Human-readable explanation
3dataset: path/to/dataset.jsonl # Test cases
4
5target: # What agent to evaluate
6 kind: agent
7 agent_file: agent.af # Agent to test
8 base_url: https://api.letta.com # Letta server
9
10graders: # How to evaluate responses
11 my_metric:
12 kind: tool # Deterministic grading
13 function: exact_match # Grading function
14 extractor: last_assistant # What to extract from agent response
15
16gate: # Pass/fail criteria
17 metric_key: my_metric # Which metric to check
18 op: gte # Greater than or equal
19 value: 0.8 # 80% threshold

Required Fields

name

The name of your evaluation suite. Used in output and results.

1name: question-answering-eval

dataset

Path to the JSONL or CSV dataset file. Can be relative (to the suite YAML) or absolute.

1dataset: ./datasets/qa.jsonl # Relative to suite YAML location

target

Specifies what agent to evaluate. See Targets for details.

graders

One or more graders to evaluate agent performance. See Graders for details.

gate

Pass/fail criteria. See Gates for details.

Optional Fields

description

A human-readable description of what this suite tests:

1description: Tests the agent's ability to answer factual questions accurately

max_samples

Limit the number of samples to evaluate (useful for quick tests):

1max_samples: 10 # Only evaluate first 10 samples

sample_tags

Filter samples by tags (only evaluate samples with these tags):

1sample_tags: [math, easy] # Only samples tagged with "math" AND "easy"

Dataset samples can include tags:

1{"input": "What is 2+2?", "ground_truth": "4", "tags": ["math", "easy"]}

num_runs

Number of times to run the entire evaluation suite (useful for testing non-deterministic behavior):

1num_runs: 5 # Run the evaluation 5 times

Default: 1

setup_script

Path to a Python script with a setup function to run before evaluation:

1setup_script: setup.py:prepare_environment # script.py:function_name

The setup function should have this signature:

1def prepare_environment(suite: SuiteSpec) -> None:
2 # Setup code here
3 pass

Path Resolution

Paths in the suite YAML are resolved relative to the YAML file location:

project/
├── suite.yaml
├── dataset.jsonl
└── agents/
└── my_agent.af
1# In suite.yaml
2dataset: dataset.jsonl # Resolves to project/dataset.jsonl
3target:
4 agent_file: agents/my_agent.af # Resolves to project/agents/my_agent.af

Absolute paths are used as-is.

Multi-Grader Suites

You can evaluate multiple metrics in one suite:

1graders:
2 accuracy: # Check if answer is correct
3 kind: tool
4 function: exact_match
5 extractor: last_assistant
6
7 completeness: # LLM judges response quality
8 kind: rubric
9 prompt_path: rubrics/completeness.txt
10 model: gpt-4o-mini
11 extractor: last_assistant
12
13 tool_usage: # Verify correct tool was called
14 kind: tool
15 function: contains
16 extractor: tool_arguments # Extract tool call arguments

The gate can check any of these metrics:

1gate:
2 metric_key: accuracy # Gate on accuracy metric (others still computed)
3 op: gte # Greater than or equal
4 value: 0.9 # 90% threshold

Results will include scores for all graders, even if you only gate on one.

Examples

Simple Tool Grader Suite

1name: basic-qa # Suite name
2dataset: questions.jsonl # Test questions
3
4target:
5 kind: agent
6 agent_file: qa_agent.af # Pre-configured agent
7 base_url: https://api.letta.com # Local server
8
9graders:
10 accuracy: # Single metric
11 kind: tool # Deterministic grading
12 function: contains # Check if ground truth is in response
13 extractor: last_assistant # Use final agent message
14
15gate:
16 metric_key: accuracy # Gate on this metric
17 op: gte # Must be >=
18 value: 0.75 # 75% to pass

Rubric Grader Suite

1name: quality-eval # Quality evaluation
2dataset: prompts.jsonl # Test prompts
3
4target:
5 kind: agent
6 agent_id: existing-agent-123 # Use existing agent
7 base_url: https://api.letta.com # Letta Cloud
8
9graders:
10 quality: # LLM-as-judge metric
11 kind: rubric # Subjective evaluation
12 prompt_path: quality_rubric.txt # Rubric template
13 model: gpt-4o-mini # Judge model
14 temperature: 0.0 # Deterministic
15 extractor: last_assistant # Evaluate final response
16
17gate:
18 metric_key: quality # Gate on this metric
19 metric: avg_score # Use average score
20 op: gte # Must be >=
21 value: 0.7 # 70% to pass

Multi-Model Suite

Test the same agent configuration across different models:

1name: model-comparison # Compare model performance
2dataset: test.jsonl # Same test for all models
3
4target:
5 kind: agent
6 agent_file: agent.af # Same agent configuration
7 base_url: https://api.letta.com # Local server
8 model_configs: [gpt-4o-mini, claude-3-5-sonnet] # Test both models
9
10graders:
11 accuracy: # Single metric for comparison
12 kind: tool
13 function: exact_match
14 extractor: last_assistant
15
16gate:
17 metric_key: accuracy # Both models must pass this
18 op: gte # Must be >=
19 value: 0.8 # 80% threshold

Results will show per-model metrics.

Validation

Validate your suite configuration before running:

$letta-evals validate suite.yaml

This checks:

  • Required fields are present
  • Paths exist
  • Configuration is valid
  • Grader/extractor combinations are compatible

Next Steps