Understanding results

This guide explains how to interpret evaluation results.

Result Structure

An evaluation produces three types of output:

Console output: Real-time progress and summary
Summary JSON: Aggregate metrics and configuration
Results JSONL: Per-sample detailed results

Console Output

Progress Display

Running evaluation: my-eval-suite
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
Results:
  Total samples: 3
  Attempted: 3
  Avg score: 0.83 (attempted: 0.83)
  Passed: 2 (66.7%)
Gate (quality >= 0.75): PASSED

Quiet Mode

$ letta-evals run suite.yaml --quiet

Output:

✓ PASSED

✗ FAILED

JSON Output

Saving Results

$ letta-evals run suite.yaml --output results/

Creates three files:

header.json

Evaluation metadata:

1 {
2   "suite_name": "my-eval-suite",
3   "timestamp": "2025-01-15T10:30:00Z",
4   "version": "0.3.0"
5 }

summary.json

Complete evaluation summary:

1 {
2   "suite": "my-eval-suite",
3   "config": {
4     "target": {...},
5     "graders": {...},
6     "gate": {...}
7   },
8   "metrics": {
9     "total": 10,
10     "total_attempted": 10,
11     "avg_score_attempted": 0.85,
12     "avg_score_total": 0.85,
13     "passed_attempts": 8,
14     "failed_attempts": 2,
15     "by_metric": {
16       "accuracy": {
17         "avg_score_attempted": 0.90,
18         "pass_rate": 90.0,
19         "passed_attempts": 9,
20         "failed_attempts": 1
21       },
22       "quality": {
23         "avg_score_attempted": 0.80,
24         "pass_rate": 70.0,
25         "passed_attempts": 7,
26         "failed_attempts": 3
27       }
28     }
29   },
30   "gates_passed": true
31 }

results.jsonl

One JSON object per line, each representing one sample:

1 {"sample": {"id": 0, "input": "What is 2+2?", "ground_truth": "4"}, "submission": "4", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-123", "model_name": "default"}
2 {"sample": {"id": 1, "input": "What is 3+3?", "ground_truth": "6"}, "submission": "6", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-124", "model_name": "default"}

Metrics Explained

total

Total number of samples in the evaluation (including errors).

total_attempted

Number of samples that completed without errors.

If a sample fails during agent execution or grading, it’s counted in total but not total_attempted.

avg_score_attempted

Average score across samples that completed successfully.

Formula: sum(scores) / total_attempted

Range: 0.0 to 1.0

avg_score_total

Average score across all samples, treating errors as 0.0.

Formula: sum(scores) / total

Range: 0.0 to 1.0

passed_attempts / failed_attempts

Number of samples that passed/failed the gate’s per-sample criteria.

By default:

If gate metric is accuracy: sample passes if score >= 1.0
If gate metric is avg_score: sample passes if score >= gate value

Can be customized with pass_op and pass_value in gate config.

by_metric

For multi-metric evaluation, shows aggregate stats for each metric:

1 "by_metric": {
2   "accuracy": {
3     "avg_score_attempted": 0.90,
4     "avg_score_total": 0.85,
5     "pass_rate": 90.0,
6     "passed_attempts": 9,
7     "failed_attempts": 1
8   }
9 }

Sample Results

Each sample result includes:

sample

The original dataset sample:

1 "sample": {
2   "id": 0,
3   "input": "What is 2+2?",
4   "ground_truth": "4",
5   "metadata": {...}
6 }

submission

The extracted text that was graded:

1 "submission": "The answer is 4"

grade

The grading result:

1 "grade": {
2   "score": 1.0,
3   "rationale": "Contains ground_truth: true",
4   "metadata": {"model": "gpt-4o-mini", "usage": {...}}
5 }

grades (multi-metric)

For multi-metric evaluation:

1 "grades": {
2   "accuracy": {"score": 1.0, "rationale": "Exact match"},
3   "quality": {"score": 0.85, "rationale": "Good but verbose"}
4 }

trajectory

The complete conversation history:

1 "trajectory": [
2   [
3     {"role": "user", "content": "What is 2+2?"},
4     {"role": "assistant", "content": "The answer is 4"}
5   ]
6 ]

agent_id

The ID of the agent that generated this response:

1 "agent_id": "agent-abc-123"

model_name

The model configuration used:

1 "model_name": "gpt-4o-mini"

agent_usage

Token usage statistics (if available):

1 "agent_usage": [
2   {"completion_tokens": 10, "prompt_tokens": 50, "total_tokens": 60}
3 ]

Interpreting Scores

Score Ranges

1.0: Perfect - fully meets criteria
0.8-0.99: Very good - minor issues
0.6-0.79: Good - notable improvements possible
0.4-0.59: Acceptable - significant issues
0.2-0.39: Poor - major problems
0.0-0.19: Failed - did not meet criteria

Binary vs Continuous

Tool graders typically return binary scores:

1.0: Passed
0.0: Failed

Rubric graders return continuous scores:

Any value from 0.0 to 1.0
Allows for partial credit

Multi-Model Results

When testing multiple models:

1 "metrics": {
2   "per_model": [
3     {
4       "model_name": "gpt-4o-mini",
5       "avg_score_attempted": 0.85,
6       "passed_samples": 8,
7       "failed_samples": 2
8     },
9     {
10       "model_name": "claude-3-5-sonnet",
11       "avg_score_attempted": 0.90,
12       "passed_samples": 9,
13       "failed_samples": 1
14     }
15   ]
16 }

Console output:

Results by model:
  gpt-4o-mini         - Avg: 0.85, Pass: 80.0%
  claude-3-5-sonnet   - Avg: 0.90, Pass: 90.0%

Multiple Runs Statistics

Run evaluations multiple times to measure consistency and get aggregate statistics.

Configuration

Specify in YAML:

1 name: my-eval-suite
2 dataset: dataset.jsonl
3 num_runs: 5  # Run 5 times
4 target:
5   kind: agent
6   agent_file: my_agent.af
7 graders:
8   accuracy:
9     kind: tool
10     function: exact_match
11 gate:
12   metric_key: accuracy
13   op: gte
14   value: 0.8

Or via CLI:

$ letta-evals run suite.yaml --num-runs 10 --output results/

Output Structure

results/
├── run_1/
│   ├── header.json
│   ├── results.jsonl
│   └── summary.json
├── run_2/
│   ├── header.json
│   ├── results.jsonl
│   └── summary.json
├── ...
└── aggregate_stats.json  # Statistics across all runs

Aggregate Statistics File

The aggregate_stats.json includes statistics across all runs:

1 {
2   "num_runs": 10,
3   "runs_passed": 8,
4   "mean_avg_score_attempted": 0.847,
5   "std_avg_score_attempted": 0.042,
6   "mean_avg_score_total": 0.847,
7   "std_avg_score_total": 0.042,
8   "mean_scores": {
9     "accuracy": 0.89,
10     "quality": 0.82
11   },
12   "std_scores": {
13     "accuracy": 0.035,
14     "quality": 0.051
15   },
16   "individual_run_metrics": [
17     {
18       "avg_score_attempted": 0.85,
19       "avg_score_total": 0.85,
20       "pass_rate": 0.85,
21       "by_metric": {
22         "accuracy": {
23           "avg_score_attempted": 0.90,
24           "avg_score_total": 0.90,
25           "pass_rate": 0.90
26         }
27       }
28     }
29     // ... metrics from runs 2-10
30   ]
31 }

Key fields:

num_runs: Total number of runs executed
runs_passed: Number of runs that passed the gate
mean_avg_score_attempted: Mean score across runs (only attempted samples)
std_avg_score_attempted: Standard deviation (measures consistency)
mean_scores: Mean for each metric (e.g., {"accuracy": 0.89})
std_scores: Standard deviation for each metric (e.g., {"accuracy": 0.035})
individual_run_metrics: Full metrics object from each individual run

Use Cases

Measure consistency of non-deterministic agents:

$ letta-evals run suite.yaml --num-runs 20 --output results/
> # Check std_avg_score_attempted in aggregate_stats.json
> # Low std = consistent, high std = variable

Get confidence intervals:

1 import json
2 import math
3 
4 with open("results/aggregate_stats.json") as f:
5     stats = json.load(f)
6 
7 mean = stats["mean_avg_score_attempted"]
8 std = stats["std_avg_score_attempted"]
9 n = stats["num_runs"]
10 
11 # 95% confidence interval (assuming normal distribution)
12 margin = 1.96 * (std / math.sqrt(n))
13 print(f"Score: {mean:.3f} ± {margin:.3f}")

Compare metric consistency:

1 with open("results/aggregate_stats.json") as f:
2     stats = json.load(f)
3 
4 for metric_name, mean in stats["mean_scores"].items():
5     std = stats["std_scores"][metric_name]
6     consistency = "consistent" if std < 0.05 else "variable"
7     print(f"{metric_name}: {mean:.3f} ± {std:.3f} ({consistency})")

Error Handling

If a sample encounters an error:

1 {
2   "sample": {...},
3   "submission": "",
4   "grade": {
5     "score": 0.0,
6     "rationale": "Error during grading: Connection timeout",
7     "metadata": {"error": "timeout", "error_type": "ConnectionError"}
8   }
9 }

Errors:

Count toward total but not total_attempted
Get score of 0.0
Include error details in rationale and metadata

Analyzing Results

Find Low Scores

1 import json
2 
3 with open("results/results.jsonl") as f:
4     results = [json.loads(line) for line in f]
5 
6 low_scores = [r for r in results if r["grade"]["score"] < 0.5]
7 print(f"Found {len(low_scores)} samples with score < 0.5")
8 
9 for result in low_scores:
10     print(f"Sample {result['sample']['id']}: {result['grade']['rationale']}")

Compare Metrics

1 # Load summary
2 with open("results/summary.json") as f:
3     summary = json.load(f)
4 
5 metrics = summary["metrics"]["by_metric"]
6 for name, stats in metrics.items():
7     print(f"{name}: {stats['avg_score_attempted']:.2f} avg, {stats['pass_rate']:.1f}% pass")

Extract Failures

1 # Find samples that failed gate criteria
2 failures = [
3     r for r in results
4     if not gate_passed(r["grade"]["score"])  # Your gate logic
5 ]

Next Steps

Gates - Setting pass/fail criteria
CLI Commands - Running evaluations