Tool Graders

Tool graders use Python functions to programmatically evaluate submissions. They’re ideal for deterministic, rule-based evaluation.

Overview

Tool graders:

  • Execute Python functions that take (sample, submission) and return a GradeResult
  • Are fast and deterministic
  • Don’t require external API calls
  • Can implement any custom logic

Configuration

1graders:
2 my_metric:
3 kind: tool
4 function: exact_match # Function name
5 extractor: last_assistant # What to extract from trajectory

Built-in Functions

exact_match

Checks if submission exactly matches ground truth (case-sensitive, whitespace-trimmed).

1graders:
2 accuracy:
3 kind: tool
4 function: exact_match
5 extractor: last_assistant

Requires: ground_truth in dataset | Score: 1.0 if exact match, 0.0 otherwise

contains

Checks if submission contains ground truth (case-insensitive).

1graders:
2 contains_answer:
3 kind: tool
4 function: contains
5 extractor: last_assistant

Requires: ground_truth in dataset | Score: 1.0 if found, 0.0 otherwise

regex_match

Checks if submission matches a regex pattern in ground truth.

1graders:
2 pattern:
3 kind: tool
4 function: regex_match
5 extractor: last_assistant

Score: 1.0 if pattern matches, 0.0 otherwise

ascii_printable_only

Validates that all characters are printable ASCII.

1graders:
2 ascii_check:
3 kind: tool
4 function: ascii_printable_only
5 extractor: last_assistant

Score: 1.0 if all characters are printable ASCII, 0.0 otherwise

Next Steps