Gates
Gates are the pass/fail criteria for your evaluation. They determine whether your agent meets the required performance threshold by checking aggregate metrics.
Quick overview:
- Single decision: One gate per suite determines pass/fail
- Two metrics:
avg_score
(average of all scores) oraccuracy
(percentage passing threshold) - Flexible operators:
>=
,>
,<=
,<
,==
for threshold comparison - Customizable pass criteria: Define what counts as “passing” for accuracy calculations
- Exit codes: Suite exits 0 for pass, 1 for fail
Common patterns:
- Average score must be 80%+:
avg_score >= 0.8
- 90%+ of samples must pass:
accuracy >= 0.9
- Custom threshold: Define per-sample pass criteria with
pass_value
Gates define the pass/fail criteria for your evaluation. They check if aggregate metrics meet a threshold.
Basic Structure
Why Use Gates?
Gates provide automated pass/fail decisions for your evaluations, which is essential for:
CI/CD Integration: Gates let you block deployments if agent performance drops:
Regression Testing: Set a baseline threshold and ensure new changes don’t degrade performance:
Quality Enforcement: Require agents meet minimum standards before production:
What Happens When Gates Fail?
When a gate condition is not met:
-
Console output shows failure message:
-
Exit code is 1 (non-zero indicates failure):
-
Results JSON includes
gate_passed: false
: -
All other data is preserved - you still get full results, scores, and trajectories even when gating fails
Common use case in CI:
Required Fields
metric_key
Which grader to evaluate. Must match a key in your graders
section:
If you only have one grader, metric_key
can be omitted - it will default to your single grader.
metric
Which aggregate statistic to compare. Two options:
avg_score
Average score across all samples (0.0 to 1.0):
Example: If scores are [0.8, 0.9, 0.6], avg_score = 0.77
accuracy
Pass rate as a percentage (0.0 to 1.0):
By default, samples with score >= 1.0
are considered “passing”.
You can customize the per-sample threshold with pass_op
and pass_value
(see below).
Note: The default metric
is avg_score
, so you can omit it if that’s what you want:
op
Comparison operator:
gte
: Greater than or equal (>=
)gt
: Greater than (>
)lte
: Less than or equal (<=
)lt
: Less than (<
)eq
: Equal (==
)
Most common: gte
(at least X)
value
Threshold value for comparison:
- For
avg_score
: 0.0 to 1.0 - For
accuracy
: 0.0 to 1.0 (representing percentage)
Optional Fields
pass_op and pass_value
Customize when individual samples are considered “passing” (used for accuracy calculation):
Default behavior:
- If
metric
isavg_score
: samples pass if score>=
the gate value - If
metric
isaccuracy
: samples pass if score>= 1.0
(perfect)
Examples
Require 80% Average Score
Passes if the average score across all samples is >= 0.8
Require 90% Pass Rate (Perfect Scores)
Passes if 90% of samples have score = 1.0
Require 75% Pass Rate (Score >= 0.7
)
Passes if 75% of samples have score >= 0.7
Maximum Error Rate
Allows up to 5% failures.
Exact Pass Rate
All samples must pass.
Multi-Metric Gating
When you have multiple graders, you can only gate on one metric:
The evaluation passes/fails based on the gated metric, but results include scores for all metrics.
Understanding avg_score vs accuracy
avg_score
- Arithmetic mean of all scores
- Sensitive to partial credit
- Good for continuous evaluation
Example:
- Scores: [1.0, 0.8, 0.6]
- avg_score = (1.0 + 0.8 + 0.6) / 3 = 0.8
accuracy
- Percentage of samples meeting a threshold
- Binary pass/fail per sample
- Good for strict requirements
Example:
- Scores: [1.0, 0.8, 0.6]
- pass_value: 0.7
- Passing: [1.0, 0.8] = 2 out of 3
- accuracy = 2/3 = 0.667 (66.7%)
Errors and Attempted Samples
If a sample fails (error during evaluation), it:
- Gets a score of 0.0
- Counts toward
total
but nottotal_attempted
- Included in
avg_score_total
but notavg_score_attempted
You can gate on either:
avg_score_total
: Includes errors as 0.0avg_score_attempted
: Excludes errors (only successfully attempted samples)
Note: The metric
field currently only supports avg_score
and accuracy
. By default, gates use avg_score_attempted
.
Gate Results
After evaluation, you’ll see:
or
The evaluation exit code reflects the gate result:
- 0: Passed
- 1: Failed
Advanced Gating
For complex gating logic (e.g., “pass if accuracy >= 80%
OR avg_score >= 0.9
”), you’ll need to:
- Run evaluation with one gate
- Examine the results JSON
- Apply custom logic in a post-processing script
Next Steps
- Understanding Results - Interpreting evaluation output
- Multi-Metric Evaluation - Using multiple graders
- Suite YAML Reference - Complete gate configuration