Understanding results

Development tools

Testing & evals

Results & metrics

Understand evaluation results, metrics, and how to analyze agent performance data.

This guide explains how to interpret evaluation results.

Result Structure

An evaluation produces three types of output:

Console output: Real-time progress and summary
Summary JSON: Aggregate metrics and configuration
Results JSONL: Per-sample detailed results

Console Output

Progress Display

Running evaluation: my-eval-suite
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%

Results:
  Total samples: 3
  Attempted: 3
  Avg score: 0.83 (attempted: 0.83)
  Passed: 2 (66.7%)

Gate (quality >= 0.75): PASSED

Quiet Mode

letta-evals run suite.yaml --quiet

Output:

✓ PASSED

✗ FAILED

JSON Output

Saving Results

letta-evals run suite.yaml --output results/

Creates three files:

header.json

Evaluation metadata:

{
  "suite_name": "my-eval-suite",
  "timestamp": "2025-01-15T10:30:00Z",
  "version": "0.3.0"
}

summary.json

Complete evaluation summary:

{
  "suite": "my-eval-suite",
  "config": {
    "target": {...},
    "graders": {...},
    "gate": {...}
  },
  "metrics": {
    "total": 10,
    "total_attempted": 10,
    "avg_score_attempted": 0.85,
    "avg_score_total": 0.85,
    "passed_attempts": 8,
    "failed_attempts": 2,
    "by_metric": {
      "accuracy": {
        "avg_score_attempted": 0.90,
        "pass_rate": 90.0,
        "passed_attempts": 9,
        "failed_attempts": 1
      },
      "quality": {
        "avg_score_attempted": 0.80,
        "pass_rate": 70.0,
        "passed_attempts": 7,
        "failed_attempts": 3
      }
    },
    "cost": {
      "total_cost": 0.0234,
      "total_prompt_tokens": 15000,
      "total_completion_tokens": 3000
    }
  },
  "gates_passed": true
}

results.jsonl

One JSON object per line, each representing one sample:

{"sample": {"id": 0, "input": "What is 2+2?", "ground_truth": "4"}, "submission": "4", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-123", "model_name": "default", "cost": 0.0012, "prompt_tokens": 500, "completion_tokens": 50}
{"sample": {"id": 1, "input": "What is 3+3?", "ground_truth": "6"}, "submission": "6", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-124", "model_name": "default", "cost": 0.0011, "prompt_tokens": 480, "completion_tokens": 45}

Metrics Explained

total

Total number of samples in the evaluation (including errors).

total_attempted

Number of samples that completed without errors.

If a sample fails during agent execution or grading, it’s counted in total but not total_attempted.

avg_score_attempted

Average score across samples that completed successfully.

Formula: sum(scores) / total_attempted

Range: 0.0 to 1.0

avg_score_total

Average score across all samples, treating errors as 0.0.

Formula: sum(scores) / total

Range: 0.0 to 1.0

passed_attempts / failed_attempts

Number of samples that passed/failed the gate’s per-sample criteria.

By default:

If gate metric is accuracy: sample passes if score >= 1.0
If gate metric is avg_score: sample passes if score >= gate value

Can be customized with pass_op and pass_value in gate config.

by_metric

For multi-metric evaluation, shows aggregate stats for each metric:

"by_metric": {
  "accuracy": {
    "avg_score_attempted": 0.90,
    "avg_score_total": 0.85,
    "pass_rate": 90.0,
    "passed_attempts": 9,
    "failed_attempts": 1
  }
}

cost (aggregate)

Cost and token usage metrics across all samples:

"cost": {
  "total_cost": 0.0234,
  "total_prompt_tokens": 15000,
  "total_completion_tokens": 3000
}

Cost tracking is automatic for supported models including:

OpenAI: GPT-4.1, GPT-4.1-mini, GPT-5, GPT-5-mini, GPT-5.1
Anthropic: Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5
Google: Gemini 3 Pro
DeepSeek, Kimi, and more

Returns null if model pricing is not available.

Sample Results

Each sample result includes:

sample

The original dataset sample:

"sample": {
  "id": 0,
  "input": "What is 2+2?",
  "ground_truth": "4",
  "metadata": {...}
}

submission

The extracted text that was graded:

"submission": "The answer is 4"

grade

The grading result:

"grade": {
  "score": 1.0,
  "rationale": "Contains ground_truth: true",
  "metadata": {"model": "gpt-4o-mini", "usage": {...}}
}

grades (multi-metric)

For multi-metric evaluation:

"grades": {
  "accuracy": {"score": 1.0, "rationale": "Exact match"},
  "quality": {"score": 0.85, "rationale": "Good but verbose"}
}

trajectory

The complete conversation history:

"trajectory": [
  [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "The answer is 4"}
  ]
]

agent_id

The ID of the agent that generated this response:

"agent_id": "agent-abc-123"

model_name

The model configuration used:

"model_name": "gpt-4o-mini"

agent_usage

Token usage statistics (if available):

"agent_usage": [
  {"completion_tokens": 10, "prompt_tokens": 50, "total_tokens": 60}
]

cost

Cost in dollars for this sample (if model pricing is available):

"cost": 0.00234

prompt_tokens / completion_tokens

Token counts for this sample:

"prompt_tokens": 1500,
"completion_tokens": 300

Interpreting Scores

Score Ranges

1.0: Perfect - fully meets criteria
0.8-0.99: Very good - minor issues
0.6-0.79: Good - notable improvements possible
0.4-0.59: Acceptable - significant issues
0.2-0.39: Poor - major problems
0.0-0.19: Failed - did not meet criteria

Binary vs Continuous

Tool graders typically return binary scores:

1.0: Passed
0.0: Failed

Rubric graders return continuous scores:

Any value from 0.0 to 1.0
Allows for partial credit

Multi-Model Results

When testing multiple models:

"metrics": {
  "per_model": [
    {
      "model_name": "gpt-4o-mini",
      "avg_score_attempted": 0.85,
      "passed_samples": 8,
      "failed_samples": 2,
      "cost": {
        "total_cost": 0.0089,
        "total_prompt_tokens": 8000,
        "total_completion_tokens": 1500
      }
    },
    {
      "model_name": "claude-3-5-sonnet",
      "avg_score_attempted": 0.90,
      "passed_samples": 9,
      "failed_samples": 1,
      "cost": {
        "total_cost": 0.0145,
        "total_prompt_tokens": 7500,
        "total_completion_tokens": 1400
      }
    }
  ]
}

Console output:

Results by model:
  gpt-4o-mini         - Avg: 0.85, Pass: 80.0%
  claude-3-5-sonnet   - Avg: 0.90, Pass: 90.0%

Multiple Runs Statistics

Run evaluations multiple times to measure consistency and get aggregate statistics.

Configuration

Specify in YAML:

name: my-eval-suite
dataset: dataset.jsonl
num_runs: 5 # Run 5 times
target:
  kind: agent
  agent_file: my_agent.af
graders:
  accuracy:
    kind: tool
    function: exact_match
gate:
  metric_key: accuracy
  op: gte
  value: 0.8

Or via CLI:

letta-evals run suite.yaml --num-runs 10 --output results/

Output Structure

results/
├── run_1/
│   ├── header.json
│   ├── results.jsonl
│   └── summary.json
├── run_2/
│   ├── header.json
│   ├── results.jsonl
│   └── summary.json
├── ...
└── aggregate_stats.json  # Statistics across all runs

Aggregate Statistics File

The aggregate_stats.json includes statistics across all runs:

{
  "num_runs": 10,
  "runs_passed": 8,
  "mean_avg_score_attempted": 0.847,
  "std_avg_score_attempted": 0.042,
  "mean_avg_score_total": 0.847,
  "std_avg_score_total": 0.042,
  "mean_scores": {
    "accuracy": 0.89,
    "quality": 0.82
  },
  "std_scores": {
    "accuracy": 0.035,
    "quality": 0.051
  },
  "individual_run_metrics": [
    {
      "avg_score_attempted": 0.85,
      "avg_score_total": 0.85,
      "pass_rate": 0.85,
      "by_metric": {
        "accuracy": {
          "avg_score_attempted": 0.9,
          "avg_score_total": 0.9,
          "pass_rate": 0.9
        }
      }
    }
    // ... metrics from runs 2-10
  ]
}

Key fields:

num_runs: Total number of runs executed
runs_passed: Number of runs that passed the gate
mean_avg_score_attempted: Mean score across runs (only attempted samples)
std_avg_score_attempted: Standard deviation (measures consistency)
mean_scores: Mean for each metric (e.g., {"accuracy": 0.89})
std_scores: Standard deviation for each metric (e.g., {"accuracy": 0.035})
individual_run_metrics: Full metrics object from each individual run

Use Cases

Measure consistency of non-deterministic agents:

letta-evals run suite.yaml --num-runs 20 --output results/
# Check std_avg_score_attempted in aggregate_stats.json
# Low std = consistent, high std = variable

Get confidence intervals:

import json
import math

with open("results/aggregate_stats.json") as f:
    stats = json.load(f)

mean = stats["mean_avg_score_attempted"]
std = stats["std_avg_score_attempted"]
n = stats["num_runs"]

# 95% confidence interval (assuming normal distribution)
margin = 1.96 * (std / math.sqrt(n))
print(f"Score: {mean:.3f} ± {margin:.3f}")

Compare metric consistency:

with open("results/aggregate_stats.json") as f:
    stats = json.load(f)

for metric_name, mean in stats["mean_scores"].items():
    std = stats["std_scores"][metric_name]
    consistency = "consistent" if std < 0.05 else "variable"
    print(f"{metric_name}: {mean:.3f} ± {std:.3f} ({consistency})")

Error Handling

If a sample encounters an error:

{
  "sample": {...},
  "submission": "",
  "grade": {
    "score": 0.0,
    "rationale": "Error during grading: Connection timeout",
    "metadata": {"error": "timeout", "error_type": "ConnectionError"}
  }
}

Errors:

Count toward total but not total_attempted
Get score of 0.0
Include error details in rationale and metadata

Analyzing Results

Find Low Scores

import json

with open("results/results.jsonl") as f:
    results = [json.loads(line) for line in f]

low_scores = [r for r in results if r["grade"]["score"] < 0.5]
print(f"Found {len(low_scores)} samples with score < 0.5")

for result in low_scores:
    print(f"Sample {result['sample']['id']}: {result['grade']['rationale']}")

Compare Metrics

# Load summary
with open("results/summary.json") as f:
    summary = json.load(f)

metrics = summary["metrics"]["by_metric"]
for name, stats in metrics.items():
    print(f"{name}: {stats['avg_score_attempted']:.2f} avg, {stats['pass_rate']:.1f}% pass")

Extract Failures

# Find samples that failed gate criteria
failures = [
    r for r in results
    if not gate_passed(r["grade"]["score"])  # Your gate logic
]

Next Steps

Gates - Setting pass/fail criteria
CLI Commands - Running evaluations