Skip to content
Sign up
Development tools
Testing & evals
Graders

Tool Graders

Grade agent tool usage, arguments, and execution patterns to ensure correct function calling.

Tool graders use Python functions to programmatically evaluate submissions. They’re ideal for deterministic, rule-based evaluation.

Tool graders:

  • Execute Python functions that take (sample, submission) and return a GradeResult
  • Are fast and deterministic
  • Don’t require external API calls
  • Can implement any custom logic
graders:
my_metric:
kind: tool
function: exact_match # Function name
extractor: last_assistant # What to extract from trajectory

Checks if submission exactly matches ground truth (case-sensitive, whitespace-trimmed).

graders:
accuracy:
kind: tool
function: exact_match
extractor: last_assistant

Requires: ground_truth in dataset | Score: 1.0 if exact match, 0.0 otherwise

Checks if submission contains ground truth (case-insensitive).

graders:
contains_answer:
kind: tool
function: contains
extractor: last_assistant

Requires: ground_truth in dataset | Score: 1.0 if found, 0.0 otherwise

Checks if submission matches a regex pattern in ground truth.

graders:
pattern:
kind: tool
function: regex_match
extractor: last_assistant

Score: 1.0 if pattern matches, 0.0 otherwise

Validates that all characters are printable ASCII.

graders:
ascii_check:
kind: tool
function: ascii_printable_only
extractor: last_assistant

Score: 1.0 if all characters are printable ASCII, 0.0 otherwise