Graders
Graders are the scoring functions that evaluate agent responses. They take the extracted submission (from an extractor) and assign a score between 0.0 (complete failure) and 1.0 (perfect success).
Quick overview:
- Two types: Tool graders (deterministic Python functions) and Rubric graders (LLM-as-judge)
- Built-in functions: exact_match, contains, regex_match, ascii_printable_only
- Custom graders: Write your own grading logic
- Multi-metric: Combine multiple graders in one suite
- Flexible extraction: Each grader can use a different extractor
When to use each:
- Tool graders: Fast, deterministic, free - perfect for exact matching, patterns, tool validation
- Rubric graders: Flexible, subjective, costs API calls - ideal for quality, creativity, nuanced evaluation
Graders evaluate agent responses and assign scores between 0.0 (complete failure) and 1.0 (perfect success).
Grader Types
There are two types of graders:
Tool Graders
Python functions that programmatically compare the submission to ground truth or apply deterministic checks.
Best for:
- Exact matching
- Pattern checking
- Tool call validation
- Deterministic criteria
Rubric Graders
LLM-as-judge evaluation using custom prompts and criteria. Can use either direct LLM API calls or a Letta agent as the judge.
Standard rubric grading (LLM API):
Agent-as-judge (Letta agent):
Best for:
- Subjective quality assessment
- Open-ended responses
- Nuanced evaluation
- Complex criteria
- Judges that need tools (when using agent-as-judge)
Built-in Tool Graders
exact_match
Checks if submission exactly matches ground truth (case-sensitive, whitespace-trimmed).
Requires: ground_truth
in dataset
Score: 1.0 if exact match, 0.0 otherwise
contains
Checks if submission contains ground truth (case-insensitive).
Requires: ground_truth
in dataset
Score: 1.0 if found, 0.0 otherwise
regex_match
Checks if submission matches a regex pattern in ground truth.
Dataset sample:
Score: 1.0 if pattern matches, 0.0 otherwise
ascii_printable_only
Validates that all characters are printable ASCII (useful for ASCII art, formatted output).
Does not require ground truth.
Score: 1.0 if all characters are printable ASCII, 0.0 if any non-printable characters found
Rubric Graders
Rubric graders use an LLM to evaluate responses based on custom criteria.
Basic Configuration
Rubric Prompt Format
Your rubric file should describe the evaluation criteria. Use placeholders:
{input}
: The original input from the dataset{submission}
: The extracted agent response{ground_truth}
: Ground truth from dataset (if available)
Example quality_rubric.txt
:
Inline Prompt
Instead of a file, you can include the prompt inline:
Model Configuration
Supported providers:
openai
(default)
Models:
- Any OpenAI-compatible model
- Special handling for reasoning models (o1, o3) - temperature automatically adjusted to 1.0
Structured Output
Rubric graders use JSON mode to get structured responses:
The score is validated to be between 0.0 and 1.0.
Multi-Metric Configuration
Evaluate multiple aspects in one suite:
Each grader can use a different extractor.
Extractor Configuration
Every grader must specify an extractor
to select what to grade:
Some extractors need additional configuration:
See Extractors for all available extractors.
Custom Graders
You can write custom grading functions. See Custom Graders for details.
Grader Selection Guide
Score Interpretation
All scores are between 0.0 and 1.0:
- 1.0: Perfect - meets all criteria
- 0.75-0.99: Good - minor issues
- 0.5-0.74: Acceptable - notable gaps
- 0.25-0.49: Poor - major problems
- 0.0-0.24: Failed - did not meet criteria
Tool graders typically return binary scores (0.0 or 1.0), while rubric graders can return any value in the range.
Error Handling
If grading fails (e.g., network error, invalid format):
- Score is set to 0.0
- Rationale includes error message
- Metadata includes error details
This ensures evaluations can continue even with individual failures.
Next Steps
- Tool Graders - Built-in and custom functions
- Rubric Graders - LLM-as-judge details
- Multi-Metric Evaluation - Using multiple graders
- Extractors - Selecting what to grade