Understanding results
Understand evaluation results, metrics, and how to analyze agent performance data.
This guide explains how to interpret evaluation results.
Result Structure
Section titled “Result Structure”An evaluation produces three types of output:
- Console output: Real-time progress and summary
- Summary JSON: Aggregate metrics and configuration
- Results JSONL: Per-sample detailed results
Console Output
Section titled “Console Output”Progress Display
Section titled “Progress Display”Running evaluation: my-eval-suite━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
Results: Total samples: 3 Attempted: 3 Avg score: 0.83 (attempted: 0.83) Passed: 2 (66.7%)
Gate (quality >= 0.75): PASSEDQuiet Mode
Section titled “Quiet Mode”letta-evals run suite.yaml --quietOutput:
✓ PASSEDor
✗ FAILEDJSON Output
Section titled “JSON Output”Saving Results
Section titled “Saving Results”letta-evals run suite.yaml --output results/Creates three files:
header.json
Section titled “header.json”Evaluation metadata:
{ "suite_name": "my-eval-suite", "timestamp": "2025-01-15T10:30:00Z", "version": "0.3.0"}summary.json
Section titled “summary.json”Complete evaluation summary:
{ "suite": "my-eval-suite", "config": { "target": {...}, "graders": {...}, "gate": {...} }, "metrics": { "total": 10, "total_attempted": 10, "avg_score_attempted": 0.85, "avg_score_total": 0.85, "passed_attempts": 8, "failed_attempts": 2, "by_metric": { "accuracy": { "avg_score_attempted": 0.90, "pass_rate": 90.0, "passed_attempts": 9, "failed_attempts": 1 }, "quality": { "avg_score_attempted": 0.80, "pass_rate": 70.0, "passed_attempts": 7, "failed_attempts": 3 } }, "cost": { "total_cost": 0.0234, "total_prompt_tokens": 15000, "total_completion_tokens": 3000 } }, "gates_passed": true}results.jsonl
Section titled “results.jsonl”One JSON object per line, each representing one sample:
{"sample": {"id": 0, "input": "What is 2+2?", "ground_truth": "4"}, "submission": "4", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-123", "model_name": "default", "cost": 0.0012, "prompt_tokens": 500, "completion_tokens": 50}{"sample": {"id": 1, "input": "What is 3+3?", "ground_truth": "6"}, "submission": "6", "grade": {"score": 1.0, "rationale": "Exact match: true"}, "trajectory": [...], "agent_id": "agent-124", "model_name": "default", "cost": 0.0011, "prompt_tokens": 480, "completion_tokens": 45}Metrics Explained
Section titled “Metrics Explained”Total number of samples in the evaluation (including errors).
total_attempted
Section titled “total_attempted”Number of samples that completed without errors.
If a sample fails during agent execution or grading, it’s counted in total but not total_attempted.
avg_score_attempted
Section titled “avg_score_attempted”Average score across samples that completed successfully.
Formula: sum(scores) / total_attempted
Range: 0.0 to 1.0
avg_score_total
Section titled “avg_score_total”Average score across all samples, treating errors as 0.0.
Formula: sum(scores) / total
Range: 0.0 to 1.0
passed_attempts / failed_attempts
Section titled “passed_attempts / failed_attempts”Number of samples that passed/failed the gate’s per-sample criteria.
By default:
- If gate metric is
accuracy: sample passes if score>= 1.0 - If gate metric is
avg_score: sample passes if score>=gate value
Can be customized with pass_op and pass_value in gate config.
by_metric
Section titled “by_metric”For multi-metric evaluation, shows aggregate stats for each metric:
"by_metric": { "accuracy": { "avg_score_attempted": 0.90, "avg_score_total": 0.85, "pass_rate": 90.0, "passed_attempts": 9, "failed_attempts": 1 }}cost (aggregate)
Section titled “cost (aggregate)”Cost and token usage metrics across all samples:
"cost": { "total_cost": 0.0234, "total_prompt_tokens": 15000, "total_completion_tokens": 3000}Cost tracking is automatic for supported models including:
- OpenAI: GPT-4.1, GPT-4.1-mini, GPT-5, GPT-5-mini, GPT-5.1
- Anthropic: Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5
- Google: Gemini 3 Pro
- DeepSeek, Kimi, and more
Returns null if model pricing is not available.
Sample Results
Section titled “Sample Results”Each sample result includes:
sample
Section titled “sample”The original dataset sample:
"sample": { "id": 0, "input": "What is 2+2?", "ground_truth": "4", "metadata": {...}}submission
Section titled “submission”The extracted text that was graded:
"submission": "The answer is 4"The grading result:
"grade": { "score": 1.0, "rationale": "Contains ground_truth: true", "metadata": {"model": "gpt-4o-mini", "usage": {...}}}grades (multi-metric)
Section titled “grades (multi-metric)”For multi-metric evaluation:
"grades": { "accuracy": {"score": 1.0, "rationale": "Exact match"}, "quality": {"score": 0.85, "rationale": "Good but verbose"}}trajectory
Section titled “trajectory”The complete conversation history:
"trajectory": [ [ {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "The answer is 4"} ]]agent_id
Section titled “agent_id”The ID of the agent that generated this response:
"agent_id": "agent-abc-123"model_name
Section titled “model_name”The model configuration used:
"model_name": "gpt-4o-mini"agent_usage
Section titled “agent_usage”Token usage statistics (if available):
"agent_usage": [ {"completion_tokens": 10, "prompt_tokens": 50, "total_tokens": 60}]Cost in dollars for this sample (if model pricing is available):
"cost": 0.00234prompt_tokens / completion_tokens
Section titled “prompt_tokens / completion_tokens”Token counts for this sample:
"prompt_tokens": 1500,"completion_tokens": 300Interpreting Scores
Section titled “Interpreting Scores”Score Ranges
Section titled “Score Ranges”- 1.0: Perfect - fully meets criteria
- 0.8-0.99: Very good - minor issues
- 0.6-0.79: Good - notable improvements possible
- 0.4-0.59: Acceptable - significant issues
- 0.2-0.39: Poor - major problems
- 0.0-0.19: Failed - did not meet criteria
Binary vs Continuous
Section titled “Binary vs Continuous”Tool graders typically return binary scores:
- 1.0: Passed
- 0.0: Failed
Rubric graders return continuous scores:
- Any value from 0.0 to 1.0
- Allows for partial credit
Multi-Model Results
Section titled “Multi-Model Results”When testing multiple models:
"metrics": { "per_model": [ { "model_name": "gpt-4o-mini", "avg_score_attempted": 0.85, "passed_samples": 8, "failed_samples": 2, "cost": { "total_cost": 0.0089, "total_prompt_tokens": 8000, "total_completion_tokens": 1500 } }, { "model_name": "claude-3-5-sonnet", "avg_score_attempted": 0.90, "passed_samples": 9, "failed_samples": 1, "cost": { "total_cost": 0.0145, "total_prompt_tokens": 7500, "total_completion_tokens": 1400 } } ]}Console output:
Results by model: gpt-4o-mini - Avg: 0.85, Pass: 80.0% claude-3-5-sonnet - Avg: 0.90, Pass: 90.0%Multiple Runs Statistics
Section titled “Multiple Runs Statistics”Run evaluations multiple times to measure consistency and get aggregate statistics.
Configuration
Section titled “Configuration”Specify in YAML:
name: my-eval-suitedataset: dataset.jsonlnum_runs: 5 # Run 5 timestarget: kind: agent agent_file: my_agent.afgraders: accuracy: kind: tool function: exact_matchgate: metric_key: accuracy op: gte value: 0.8Or via CLI:
letta-evals run suite.yaml --num-runs 10 --output results/Output Structure
Section titled “Output Structure”results/├── run_1/│ ├── header.json│ ├── results.jsonl│ └── summary.json├── run_2/│ ├── header.json│ ├── results.jsonl│ └── summary.json├── ...└── aggregate_stats.json # Statistics across all runsAggregate Statistics File
Section titled “Aggregate Statistics File”The aggregate_stats.json includes statistics across all runs:
{ "num_runs": 10, "runs_passed": 8, "mean_avg_score_attempted": 0.847, "std_avg_score_attempted": 0.042, "mean_avg_score_total": 0.847, "std_avg_score_total": 0.042, "mean_scores": { "accuracy": 0.89, "quality": 0.82 }, "std_scores": { "accuracy": 0.035, "quality": 0.051 }, "individual_run_metrics": [ { "avg_score_attempted": 0.85, "avg_score_total": 0.85, "pass_rate": 0.85, "by_metric": { "accuracy": { "avg_score_attempted": 0.9, "avg_score_total": 0.9, "pass_rate": 0.9 } } } // ... metrics from runs 2-10 ]}Key fields:
num_runs: Total number of runs executedruns_passed: Number of runs that passed the gatemean_avg_score_attempted: Mean score across runs (only attempted samples)std_avg_score_attempted: Standard deviation (measures consistency)mean_scores: Mean for each metric (e.g.,{"accuracy": 0.89})std_scores: Standard deviation for each metric (e.g.,{"accuracy": 0.035})individual_run_metrics: Full metrics object from each individual run
Use Cases
Section titled “Use Cases”Measure consistency of non-deterministic agents:
letta-evals run suite.yaml --num-runs 20 --output results/# Check std_avg_score_attempted in aggregate_stats.json# Low std = consistent, high std = variableGet confidence intervals:
import jsonimport math
with open("results/aggregate_stats.json") as f: stats = json.load(f)
mean = stats["mean_avg_score_attempted"]std = stats["std_avg_score_attempted"]n = stats["num_runs"]
# 95% confidence interval (assuming normal distribution)margin = 1.96 * (std / math.sqrt(n))print(f"Score: {mean:.3f} ± {margin:.3f}")Compare metric consistency:
with open("results/aggregate_stats.json") as f: stats = json.load(f)
for metric_name, mean in stats["mean_scores"].items(): std = stats["std_scores"][metric_name] consistency = "consistent" if std < 0.05 else "variable" print(f"{metric_name}: {mean:.3f} ± {std:.3f} ({consistency})")Error Handling
Section titled “Error Handling”If a sample encounters an error:
{ "sample": {...}, "submission": "", "grade": { "score": 0.0, "rationale": "Error during grading: Connection timeout", "metadata": {"error": "timeout", "error_type": "ConnectionError"} }}Errors:
- Count toward
totalbut nottotal_attempted - Get score of 0.0
- Include error details in rationale and metadata
Analyzing Results
Section titled “Analyzing Results”Find Low Scores
Section titled “Find Low Scores”import json
with open("results/results.jsonl") as f: results = [json.loads(line) for line in f]
low_scores = [r for r in results if r["grade"]["score"] < 0.5]print(f"Found {len(low_scores)} samples with score < 0.5")
for result in low_scores: print(f"Sample {result['sample']['id']}: {result['grade']['rationale']}")Compare Metrics
Section titled “Compare Metrics”# Load summarywith open("results/summary.json") as f: summary = json.load(f)
metrics = summary["metrics"]["by_metric"]for name, stats in metrics.items(): print(f"{name}: {stats['avg_score_attempted']:.2f} avg, {stats['pass_rate']:.1f}% pass")Extract Failures
Section titled “Extract Failures”# Find samples that failed gate criteriafailures = [ r for r in results if not gate_passed(r["grade"]["score"]) # Your gate logic]Next Steps
Section titled “Next Steps”- Gates - Setting pass/fail criteria
- CLI Commands - Running evaluations