Tool Graders
Tool graders use Python functions to programmatically evaluate submissions. They’re ideal for deterministic, rule-based evaluation.
Overview
Tool graders:
- Execute Python functions that take
(sample, submission)
and return aGradeResult
- Are fast and deterministic
- Don’t require external API calls
- Can implement any custom logic
Configuration
Built-in Functions
exact_match
Checks if submission exactly matches ground truth (case-sensitive, whitespace-trimmed).
Requires: ground_truth
in dataset | Score: 1.0 if exact match, 0.0 otherwise
contains
Checks if submission contains ground truth (case-insensitive).
Requires: ground_truth
in dataset | Score: 1.0 if found, 0.0 otherwise
regex_match
Checks if submission matches a regex pattern in ground truth.
Score: 1.0 if pattern matches, 0.0 otherwise
ascii_printable_only
Validates that all characters are printable ASCII.
Score: 1.0 if all characters are printable ASCII, 0.0 otherwise
Next Steps
- Rubric Graders - LLM-as-judge evaluation
- Custom Graders - Write your own grading functions
- Multi-Metric - Combine multiple graders