Suites | Letta

A suite is a YAML configuration file that defines a complete evaluation specification. It’s the central piece that ties together your dataset, target agent, grading criteria, and pass/fail thresholds.

Quick overview:

Single file defines everything: Dataset, agent, graders, and success criteria all in one YAML
Reusable and shareable: Version control your evaluation specs alongside your code
Multi-metric support: Evaluate multiple aspects (accuracy, quality, tool usage) in one run
Multi-model testing: Run the same suite across different LLM models
Flexible filtering: Test subsets using tags or sample limits

Typical workflow:

Create a suite YAML defining what and how to test
Run letta-evals run suite.yaml
Review results showing scores for each metric
Suite passes or fails based on gate criteria

An evaluation suite is a YAML configuration file that defines a complete test specification.

Basic Structure

1 name: my-evaluation  # Suite identifier
2 description: Optional description of what this tests  # Human-readable explanation
3 dataset: path/to/dataset.jsonl  # Test cases
4 
5 target:  # What agent to evaluate
6   kind: agent
7   agent_file: agent.af  # Agent to test
8   base_url: https://api.letta.com  # Letta server
9 
10 graders:  # How to evaluate responses
11   my_metric:
12     kind: tool  # Deterministic grading
13     function: exact_match  # Grading function
14     extractor: last_assistant  # What to extract from agent response
15 
16 gate:  # Pass/fail criteria
17   metric_key: my_metric  # Which metric to check
18   op: gte  # Greater than or equal
19   value: 0.8  # 80% threshold

Required Fields

name

The name of your evaluation suite. Used in output and results.

1 name: question-answering-eval

dataset

Path to the JSONL or CSV dataset file. Can be relative (to the suite YAML) or absolute.

1 dataset: ./datasets/qa.jsonl  # Relative to suite YAML location

target

Specifies what agent to evaluate. See Targets for details.

graders

One or more graders to evaluate agent performance. See Graders for details.

gate

Pass/fail criteria. See Gates for details.

Optional Fields

description

A human-readable description of what this suite tests:

1 description: Tests the agent's ability to answer factual questions accurately

max_samples

Limit the number of samples to evaluate (useful for quick tests):

1 max_samples: 10  # Only evaluate first 10 samples

sample_tags

Filter samples by tags (only evaluate samples with these tags):

1 sample_tags: [math, easy]  # Only samples tagged with "math" AND "easy"

Dataset samples can include tags:

1 {"input": "What is 2+2?", "ground_truth": "4", "tags": ["math", "easy"]}

num_runs

Number of times to run the entire evaluation suite (useful for testing non-deterministic behavior):

1 num_runs: 5  # Run the evaluation 5 times

Default: 1

setup_script

Path to a Python script with a setup function to run before evaluation:

1 setup_script: setup.py:prepare_environment  # script.py:function_name

The setup function should have this signature:

1 def prepare_environment(suite: SuiteSpec) -> None:
2     # Setup code here
3     pass

Path Resolution

Paths in the suite YAML are resolved relative to the YAML file location:

project/
├── suite.yaml
├── dataset.jsonl
└── agents/
    └── my_agent.af

1 # In suite.yaml
2 dataset: dataset.jsonl  # Resolves to project/dataset.jsonl
3 target:
4   agent_file: agents/my_agent.af  # Resolves to project/agents/my_agent.af

Absolute paths are used as-is.

Multi-Grader Suites

You can evaluate multiple metrics in one suite:

1 graders:
2   accuracy:  # Check if answer is correct
3     kind: tool
4     function: exact_match
5     extractor: last_assistant
6 
7   completeness:  # LLM judges response quality
8     kind: rubric
9     prompt_path: rubrics/completeness.txt
10     model: gpt-4o-mini
11     extractor: last_assistant
12 
13   tool_usage:  # Verify correct tool was called
14     kind: tool
15     function: contains
16     extractor: tool_arguments  # Extract tool call arguments

The gate can check any of these metrics:

1 gate:
2   metric_key: accuracy  # Gate on accuracy metric (others still computed)
3   op: gte  # Greater than or equal
4   value: 0.9  # 90% threshold

Results will include scores for all graders, even if you only gate on one.

Examples

Simple Tool Grader Suite

1 name: basic-qa  # Suite name
2 dataset: questions.jsonl  # Test questions
3 
4 target:
5   kind: agent
6   agent_file: qa_agent.af  # Pre-configured agent
7   base_url: https://api.letta.com  # Local server
8 
9 graders:
10   accuracy:  # Single metric
11     kind: tool  # Deterministic grading
12     function: contains  # Check if ground truth is in response
13     extractor: last_assistant  # Use final agent message
14 
15 gate:
16   metric_key: accuracy  # Gate on this metric
17   op: gte  # Must be >=
18   value: 0.75  # 75% to pass

Rubric Grader Suite

1 name: quality-eval  # Quality evaluation
2 dataset: prompts.jsonl  # Test prompts
3 
4 target:
5   kind: agent
6   agent_id: existing-agent-123  # Use existing agent
7   base_url: https://api.letta.com  # Letta Cloud
8 
9 graders:
10   quality:  # LLM-as-judge metric
11     kind: rubric  # Subjective evaluation
12     prompt_path: quality_rubric.txt  # Rubric template
13     model: gpt-4o-mini  # Judge model
14     temperature: 0.0  # Deterministic
15     extractor: last_assistant  # Evaluate final response
16 
17 gate:
18   metric_key: quality  # Gate on this metric
19   metric: avg_score  # Use average score
20   op: gte  # Must be >=
21   value: 0.7  # 70% to pass

Multi-Model Suite

Test the same agent configuration across different models:

1 name: model-comparison  # Compare model performance
2 dataset: test.jsonl  # Same test for all models
3 
4 target:
5   kind: agent
6   agent_file: agent.af  # Same agent configuration
7   base_url: https://api.letta.com  # Local server
8   model_configs: [gpt-4o-mini, claude-3-5-sonnet]  # Test both models
9 
10 graders:
11   accuracy:  # Single metric for comparison
12     kind: tool
13     function: exact_match
14     extractor: last_assistant
15 
16 gate:
17   metric_key: accuracy  # Both models must pass this
18   op: gte  # Must be >=
19   value: 0.8  # 80% threshold

Results will show per-model metrics.

Validation

Validate your suite configuration before running:

$ letta-evals validate suite.yaml

This checks:

Required fields are present
Paths exist
Configuration is valid
Grader/extractor combinations are compatible