Understanding Results

This guide explains how to interpret evaluation results.

Result Structure

An evaluation produces three types of output:

  1. Console output: Real-time progress and summary
  2. Summary JSON: Aggregate metrics and configuration
  3. Results JSONL: Per-sample detailed results

Console Output

Progress Display

Running evaluation: my-eval-suite
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
Results:
Total samples: 3
Attempted: 3
Avg score: 0.83 (attempted: 0.83)
Passed: 2 (66.7%)
Gate (quality >= 0.75): PASSED

Quiet Mode

$letta-evals run suite.yaml --quiet

Output:

✓ PASSED

or

✗ FAILED

JSON Output

Saving Results

$letta-evals run suite.yaml --output results/

Creates three files:

header.json

Evaluation metadata:

1{
2 "suite_name": "my-eval-suite",
3 "timestamp": "2025-01-15T10:30:00Z",
4 "version": "0.3.0"
5}

summary.json

Complete evaluation summary:

1{
2 "suite": "my-eval-suite",
3 "config": {
4 "target": {...},
5 "graders": {...},
6 "gate": {...}
7 },
8 "metrics": {
9 "total": 10,
10 "total_attempted": 10,
11 "avg_score_attempted": 0.85,
12 "avg_score_total": 0.85,
13 "passed_attempts": 8,
14 "failed_attempts": 2,
15 "by_metric": {
16 "accuracy": {
17 "avg_score_attempted": 0.90,
18 "pass_rate": 90.0,
19 "passed_attempts": 9,
20 "failed_attempts": 1
21 },
22 "quality": {
23 "avg_score_attempted": 0.80,
24 "pass_rate": 70.0,
25 "passed_attempts": 7,
26 "failed_attempts": 3
27 }
28 }
29 },
30 "gates_passed": true
31}

results.jsonl

One JSON object per line, each representing one sample:

1{"sample": {"id": 0, "input": "What is 2+2?", "ground_truth": "4"}, "submission": "4", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-123", "model_name": "default"}
2{"sample": {"id": 1, "input": "What is 3+3?", "ground_truth": "6"}, "submission": "6", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-124", "model_name": "default"}

Metrics Explained

total

Total number of samples in the evaluation (including errors).

total_attempted

Number of samples that completed without errors.

If a sample fails during agent execution or grading, it’s counted in total but not total_attempted.

avg_score_attempted

Average score across samples that completed successfully.

Formula: sum(scores) / total_attempted

Range: 0.0 to 1.0

avg_score_total

Average score across all samples, treating errors as 0.0.

Formula: sum(scores) / total

Range: 0.0 to 1.0

passed_attempts / failed_attempts

Number of samples that passed/failed the gate’s per-sample criteria.

By default:

  • If gate metric is accuracy: sample passes if score >= 1.0
  • If gate metric is avg_score: sample passes if score >= gate value

Can be customized with pass_op and pass_value in gate config.

by_metric

For multi-metric evaluation, shows aggregate stats for each metric:

1"by_metric": {
2 "accuracy": {
3 "avg_score_attempted": 0.90,
4 "avg_score_total": 0.85,
5 "pass_rate": 90.0,
6 "passed_attempts": 9,
7 "failed_attempts": 1
8 }
9}

Sample Results

Each sample result includes:

sample

The original dataset sample:

1"sample": {
2 "id": 0,
3 "input": "What is 2+2?",
4 "ground_truth": "4",
5 "metadata": {...}
6}

submission

The extracted text that was graded:

1"submission": "The answer is 4"

grade

The grading result:

1"grade": {
2 "score": 1.0,
3 "rationale": "Contains ground_truth: true",
4 "metadata": {"model": "gpt-4o-mini", "usage": {...}}
5}

grades (multi-metric)

For multi-metric evaluation:

1"grades": {
2 "accuracy": {"score": 1.0, "rationale": "Exact match"},
3 "quality": {"score": 0.85, "rationale": "Good but verbose"}
4}

trajectory

The complete conversation history:

1"trajectory": [
2 [
3 {"role": "user", "content": "What is 2+2?"},
4 {"role": "assistant", "content": "The answer is 4"}
5 ]
6]

agent_id

The ID of the agent that generated this response:

1"agent_id": "agent-abc-123"

model_name

The model configuration used:

1"model_name": "gpt-4o-mini"

agent_usage

Token usage statistics (if available):

1"agent_usage": [
2 {"completion_tokens": 10, "prompt_tokens": 50, "total_tokens": 60}
3]

Interpreting Scores

Score Ranges

  • 1.0: Perfect - fully meets criteria
  • 0.8-0.99: Very good - minor issues
  • 0.6-0.79: Good - notable improvements possible
  • 0.4-0.59: Acceptable - significant issues
  • 0.2-0.39: Poor - major problems
  • 0.0-0.19: Failed - did not meet criteria

Binary vs Continuous

Tool graders typically return binary scores:

  • 1.0: Passed
  • 0.0: Failed

Rubric graders return continuous scores:

  • Any value from 0.0 to 1.0
  • Allows for partial credit

Multi-Model Results

When testing multiple models:

1"metrics": {
2 "per_model": [
3 {
4 "model_name": "gpt-4o-mini",
5 "avg_score_attempted": 0.85,
6 "passed_samples": 8,
7 "failed_samples": 2
8 },
9 {
10 "model_name": "claude-3-5-sonnet",
11 "avg_score_attempted": 0.90,
12 "passed_samples": 9,
13 "failed_samples": 1
14 }
15 ]
16}

Console output:

Results by model:
gpt-4o-mini - Avg: 0.85, Pass: 80.0%
claude-3-5-sonnet - Avg: 0.90, Pass: 90.0%

Multiple Runs Statistics

Run evaluations multiple times to measure consistency and get aggregate statistics.

Configuration

Specify in YAML:

1name: my-eval-suite
2dataset: dataset.jsonl
3num_runs: 5 # Run 5 times
4target:
5 kind: agent
6 agent_file: my_agent.af
7graders:
8 accuracy:
9 kind: tool
10 function: exact_match
11gate:
12 metric_key: accuracy
13 op: gte
14 value: 0.8

Or via CLI:

$letta-evals run suite.yaml --num-runs 10 --output results/

Output Structure

results/
├── run_1/
│ ├── header.json
│ ├── results.jsonl
│ └── summary.json
├── run_2/
│ ├── header.json
│ ├── results.jsonl
│ └── summary.json
├── ...
└── aggregate_stats.json # Statistics across all runs

Aggregate Statistics File

The aggregate_stats.json includes statistics across all runs:

1{
2 "num_runs": 10,
3 "runs_passed": 8,
4 "mean_avg_score_attempted": 0.847,
5 "std_avg_score_attempted": 0.042,
6 "mean_avg_score_total": 0.847,
7 "std_avg_score_total": 0.042,
8 "mean_scores": {
9 "accuracy": 0.89,
10 "quality": 0.82
11 },
12 "std_scores": {
13 "accuracy": 0.035,
14 "quality": 0.051
15 },
16 "individual_run_metrics": [
17 {
18 "avg_score_attempted": 0.85,
19 "avg_score_total": 0.85,
20 "pass_rate": 0.85,
21 "by_metric": {
22 "accuracy": {
23 "avg_score_attempted": 0.90,
24 "avg_score_total": 0.90,
25 "pass_rate": 0.90
26 }
27 }
28 }
29 // ... metrics from runs 2-10
30 ]
31}

Key fields:

  • num_runs: Total number of runs executed
  • runs_passed: Number of runs that passed the gate
  • mean_avg_score_attempted: Mean score across runs (only attempted samples)
  • std_avg_score_attempted: Standard deviation (measures consistency)
  • mean_scores: Mean for each metric (e.g., {"accuracy": 0.89})
  • std_scores: Standard deviation for each metric (e.g., {"accuracy": 0.035})
  • individual_run_metrics: Full metrics object from each individual run

Use Cases

Measure consistency of non-deterministic agents:

$letta-evals run suite.yaml --num-runs 20 --output results/
># Check std_avg_score_attempted in aggregate_stats.json
># Low std = consistent, high std = variable

Get confidence intervals:

1import json
2import math
3
4with open("results/aggregate_stats.json") as f:
5 stats = json.load(f)
6
7mean = stats["mean_avg_score_attempted"]
8std = stats["std_avg_score_attempted"]
9n = stats["num_runs"]
10
11# 95% confidence interval (assuming normal distribution)
12margin = 1.96 * (std / math.sqrt(n))
13print(f"Score: {mean:.3f} ± {margin:.3f}")

Compare metric consistency:

1with open("results/aggregate_stats.json") as f:
2 stats = json.load(f)
3
4for metric_name, mean in stats["mean_scores"].items():
5 std = stats["std_scores"][metric_name]
6 consistency = "consistent" if std < 0.05 else "variable"
7 print(f"{metric_name}: {mean:.3f} ± {std:.3f} ({consistency})")

Error Handling

If a sample encounters an error:

1{
2 "sample": {...},
3 "submission": "",
4 "grade": {
5 "score": 0.0,
6 "rationale": "Error during grading: Connection timeout",
7 "metadata": {"error": "timeout", "error_type": "ConnectionError"}
8 }
9}

Errors:

  • Count toward total but not total_attempted
  • Get score of 0.0
  • Include error details in rationale and metadata

Analyzing Results

Find Low Scores

1import json
2
3with open("results/results.jsonl") as f:
4 results = [json.loads(line) for line in f]
5
6low_scores = [r for r in results if r["grade"]["score"] < 0.5]
7print(f"Found {len(low_scores)} samples with score < 0.5")
8
9for result in low_scores:
10 print(f"Sample {result['sample']['id']}: {result['grade']['rationale']}")

Compare Metrics

1# Load summary
2with open("results/summary.json") as f:
3 summary = json.load(f)
4
5metrics = summary["metrics"]["by_metric"]
6for name, stats in metrics.items():
7 print(f"{name}: {stats['avg_score_attempted']:.2f} avg, {stats['pass_rate']:.1f}% pass")

Extract Failures

1# Find samples that failed gate criteria
2failures = [
3 r for r in results
4 if not gate_passed(r["grade"]["score"]) # Your gate logic
5]

Next Steps