Tool Graders

Development tools

Testing & evals

Graders

Grade agent tool usage, arguments, and execution patterns to ensure correct function calling.

Tool graders use Python functions to programmatically evaluate submissions. They’re ideal for deterministic, rule-based evaluation.

Overview

Tool graders:

Execute Python functions that take (sample, submission) and return a GradeResult
Are fast and deterministic
Don’t require external API calls
Can implement any custom logic

graders:
  my_metric:
    kind: tool
    function: exact_match # Function name
    extractor: last_assistant # What to extract from trajectory

Checks if submission exactly matches ground truth (case-sensitive, whitespace-trimmed).

graders:
  accuracy:
    kind: tool
    function: exact_match
    extractor: last_assistant

Requires: ground_truth in dataset | Score: 1.0 if exact match, 0.0 otherwise

Checks if submission contains ground truth (case-insensitive).

graders:
  contains_answer:
    kind: tool
    function: contains
    extractor: last_assistant

Requires: ground_truth in dataset | Score: 1.0 if found, 0.0 otherwise

Checks if submission matches a regex pattern in ground truth.

graders:
  pattern:
    kind: tool
    function: regex_match
    extractor: last_assistant

Score: 1.0 if pattern matches, 0.0 otherwise

Validates that all characters are printable ASCII.

graders:
  ascii_check:
    kind: tool
    function: ascii_printable_only
    extractor: last_assistant

Score: 1.0 if all characters are printable ASCII, 0.0 otherwise