---
title: Suite YAML reference | Letta Docs
---

Complete reference for suite configuration files.

A **suite** is a YAML file that defines an evaluation: what agent to test, what dataset to use, how to grade responses, and what criteria determine pass/fail. This is your evaluation specification.

**Quick overview:** - **name**: Identifier for your evaluation - **dataset**: JSONL file with test cases - **target**: Which agent to evaluate (via file, ID, or script) - **graders**: How to score responses (tool or rubric graders)

- **gate**: Pass/fail criteria

See [Getting Started](/guides/evals/getting-started/index.md) for a tutorial, or [Core Concepts](/guides/evals/concepts/suites/index.md) for conceptual overview.

## File Structure

```
name: string (required)
description: string (optional)
dataset: path (required)
max_samples: integer (optional)
sample_tags: array (optional)
num_runs: integer (optional)
setup_script: string (optional)


target: object (required)
  kind: "agent"
  base_url: string
  api_key: string
  timeout: float
  project_id: string
  agent_id: string (one of: agent_id, agent_file, agent_script)
  agent_file: path
  agent_script: string
  model_configs: array
  model_handles: array


graders: object (required)
  <metric_key>: object
    kind: "tool" | "rubric"
    display_name: string
    extractor: string
    extractor_config: object
    # Tool grader fields
    function: string
    # Rubric grader fields (LLM API)
    prompt: string
    prompt_path: path
    model: string
    temperature: float
    provider: string
    max_retries: integer
    timeout: float
    rubric_vars: array
    # Rubric grader fields (agent-as-judge)
    agent_file: path
    judge_tool_name: string


gate: object (required)
  kind: "simple" | "logical" | "weighted_average"
  metric_key: string
  aggregation: "avg_score" | "accuracy"
  op: "gte" | "gt" | "lte" | "lt" | "eq"
  value: float
  pass_threshold: float
  # For logical gates:
  operator: "and" | "or"
  conditions: array
  # For weighted_average gates:
  weights: object
```

## Top-Level Fields

### name (required)

Suite name, used in output and results.

**Type**: string

```
name: question-answering-eval
```

### description (optional)

Human-readable description of what the suite tests.

**Type**: string

```
description: Tests agent's ability to answer factual questions accurately
```

### dataset (required)

Path to JSONL dataset file. Relative paths are resolved from the suite YAML location.

**Type**: path (string)

```
dataset: ./datasets/qa.jsonl
dataset: /absolute/path/to/dataset.jsonl
```

### max\_samples (optional)

Limit the number of samples to evaluate. Useful for quick tests.

**Type**: integer | **Default**: All samples

```
max_samples: 10 # Only evaluate first 10 samples
```

### sample\_tags (optional)

Filter samples by tags. Only samples with ALL specified tags are evaluated.

**Type**: array of strings

```
sample_tags: [math, easy] # Only samples tagged with both
```

### num\_runs (optional)

Number of times to run the evaluation suite.

**Type**: integer | **Default**: 1

```
num_runs: 5 # Run the evaluation 5 times
```

### setup\_script (optional)

Path to Python script with setup function.

**Type**: string (format: `path/to/script.py:function_name`)

```
setup_script: setup.py:prepare_environment
```

## target (required)

Configuration for the agent being evaluated.

### kind (required)

Type of target. Currently only `"agent"` is supported.

```
target:
  kind: agent
```

### base\_url (optional)

Letta server URL. **Default**: `https://api.letta.com`

```
target:
  base_url: https://api.letta.com
  # or
  base_url: https://api.letta.com
```

### api\_key (optional)

API key for Letta authentication. Can also be set via `LETTA_API_KEY` environment variable.

```
target:
  api_key: your-api-key-here
```

### timeout (optional)

Request timeout in seconds. **Default**: 300.0

```
target:
  timeout: 600.0 # 10 minutes
```

### Agent Source (required, pick one)

Exactly one of these must be specified:

#### agent\_id

ID of existing agent on the server.

```
target:
  agent_id: agent-123-abc
```

#### agent\_file

Path to `.af` agent file.

```
target:
  agent_file: ./agents/my_agent.af
```

#### agent\_script

Path to Python script with agent factory.

```
target:
  agent_script: factory.py:MyAgentFactory
```

See [Targets](/guides/evals/concepts/targets/index.md) for details on agent sources.

### model\_configs (optional)

List of model configuration names to test. Cannot be used with `model_handles`.

```
target:
  model_configs: [gpt-4o-mini, claude-3-5-sonnet]
```

### model\_handles (optional)

List of model handles for cloud deployments. Cannot be used with `model_configs`.

```
target:
  model_handles: ["openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"]
```

## graders (required)

One or more graders, each with a unique key.

### kind (required)

Grader type: `"tool"` or `"rubric"`.

```
graders:
  my_metric:
    kind: tool
```

### extractor (required)

Name of the extractor to use.

```
graders:
  my_metric:
    extractor: last_assistant
```

### Tool Grader Fields

#### function (required for tool graders)

Name of the grading function.

```
graders:
  accuracy:
    kind: tool
    function: exact_match
```

### Rubric Grader Fields

#### prompt or prompt\_path (required)

Inline rubric prompt or path to rubric file.

```
graders:
  quality:
    kind: rubric
    prompt: |
      Evaluate response quality from 0.0 to 1.0.
```

#### model (optional)

LLM model for judging. **Default**: `gpt-4o-mini`

```
graders:
  quality:
    kind: rubric
    model: gpt-4o
```

#### temperature (optional)

Temperature for LLM generation. **Default**: 0.0

```
graders:
  quality:
    kind: rubric
    temperature: 0.0
```

#### agent\_file (agent-as-judge)

Path to `.af` agent file to use as judge.

```
graders:
  agent_judge:
    kind: rubric
    agent_file: judge.af
    prompt_path: rubric.txt
```

## gate (required)

Pass/fail criteria for the evaluation.

### kind (required)

Type of gate: `simple`, `logical`, or `weighted_average`.

```
gate:
  kind: simple
```

### metric\_key (required for simple gates)

Which grader to evaluate.

```
gate:
  kind: simple
  metric_key: accuracy
```

### aggregation (required)

Which aggregate to compare: `avg_score` or `accuracy`.

```
gate:
  kind: simple
  aggregation: avg_score
```

### op (required)

Comparison operator: `gte`, `gt`, `lte`, `lt`, `eq`

```
gate:
  kind: simple
  op: gte # Greater than or equal
```

### value (required)

Threshold value for comparison (0.0 to 1.0).

```
gate:
  kind: simple
  value: 0.8 # Require >= 0.8
```

### pass\_threshold (optional)

Per-sample threshold for accuracy calculations.

```
gate:
  kind: simple
  aggregation: accuracy
  pass_threshold: 0.7 # Sample passes if score >= 0.7
```

## Complete Examples

### Minimal Suite

```
name: basic-eval
dataset: dataset.jsonl


target:
  kind: letta_agent
  agent_file: agent.af
  base_url: http://localhost:8283


graders:
  accuracy:
    kind: tool
    function: exact_match
    extractor: last_assistant


gate:
  kind: simple
  metric_key: accuracy
  aggregation: avg_score
  op: gte
  value: 0.8
```

### Multi-Metric Suite

```
name: comprehensive-eval
description: Tests accuracy and quality
dataset: test_data.jsonl


target:
  kind: letta_agent
  agent_file: agent.af
  base_url: http://localhost:8283


graders:
  accuracy:
    kind: tool
    function: contains
    extractor: last_assistant


  quality:
    kind: rubric
    prompt_path: rubrics/quality.txt
    model: gpt-4o-mini
    extractor: last_assistant


gate:
  kind: simple
  metric_key: accuracy
  aggregation: avg_score
  op: gte
  value: 0.85
```

## Validation

Validate your suite before running:

Terminal window

```
letta-evals validate suite.yaml
```

## Next Steps

- [Targets](/guides/evals/concepts/targets/index.md) - Understanding agent sources and configuration
- [Graders](/guides/evals/concepts/graders/index.md) - Tool graders vs rubric graders
- [Extractors](/guides/evals/concepts/extractors/index.md) - What to extract from agent responses
- [Gates](/guides/evals/concepts/gates/index.md) - Setting pass/fail criteria