Understanding Results
This guide explains how to interpret evaluation results.
Result Structure
An evaluation produces three types of output:
- Console output: Real-time progress and summary
- Summary JSON: Aggregate metrics and configuration
- Results JSONL: Per-sample detailed results
Console Output
Progress Display
Quiet Mode
Output:
or
JSON Output
Saving Results
Creates three files:
header.json
Evaluation metadata:
summary.json
Complete evaluation summary:
results.jsonl
One JSON object per line, each representing one sample:
Metrics Explained
total
Total number of samples in the evaluation (including errors).
total_attempted
Number of samples that completed without errors.
If a sample fails during agent execution or grading, it’s counted in total but not total_attempted.
avg_score_attempted
Average score across samples that completed successfully.
Formula: sum(scores) / total_attempted
Range: 0.0 to 1.0
avg_score_total
Average score across all samples, treating errors as 0.0.
Formula: sum(scores) / total
Range: 0.0 to 1.0
passed_attempts / failed_attempts
Number of samples that passed/failed the gate’s per-sample criteria.
By default:
- If gate metric is
accuracy: sample passes if score>= 1.0 - If gate metric is
avg_score: sample passes if score>=gate value
Can be customized with pass_op and pass_value in gate config.
by_metric
For multi-metric evaluation, shows aggregate stats for each metric:
Sample Results
Each sample result includes:
sample
The original dataset sample:
submission
The extracted text that was graded:
grade
The grading result:
grades (multi-metric)
For multi-metric evaluation:
trajectory
The complete conversation history:
agent_id
The ID of the agent that generated this response:
model_name
The model configuration used:
agent_usage
Token usage statistics (if available):
Interpreting Scores
Score Ranges
- 1.0: Perfect - fully meets criteria
- 0.8-0.99: Very good - minor issues
- 0.6-0.79: Good - notable improvements possible
- 0.4-0.59: Acceptable - significant issues
- 0.2-0.39: Poor - major problems
- 0.0-0.19: Failed - did not meet criteria
Binary vs Continuous
Tool graders typically return binary scores:
- 1.0: Passed
- 0.0: Failed
Rubric graders return continuous scores:
- Any value from 0.0 to 1.0
- Allows for partial credit
Multi-Model Results
When testing multiple models:
Console output:
Multiple Runs Statistics
Run evaluations multiple times to measure consistency and get aggregate statistics.
Configuration
Specify in YAML:
Or via CLI:
Output Structure
Aggregate Statistics File
The aggregate_stats.json includes statistics across all runs:
Key fields:
num_runs: Total number of runs executedruns_passed: Number of runs that passed the gatemean_avg_score_attempted: Mean score across runs (only attempted samples)std_avg_score_attempted: Standard deviation (measures consistency)mean_scores: Mean for each metric (e.g.,{"accuracy": 0.89})std_scores: Standard deviation for each metric (e.g.,{"accuracy": 0.035})individual_run_metrics: Full metrics object from each individual run
Use Cases
Measure consistency of non-deterministic agents:
Get confidence intervals:
Compare metric consistency:
Error Handling
If a sample encounters an error:
Errors:
- Count toward
totalbut nottotal_attempted - Get score of 0.0
- Include error details in rationale and metadata
Analyzing Results
Find Low Scores
Compare Metrics
Extract Failures
Next Steps
- Gates - Setting pass/fail criteria
- CLI Commands - Running evaluations