Graders

Graders are the scoring functions that evaluate agent responses. They take the extracted submission (from an extractor) and assign a score between 0.0 (complete failure) and 1.0 (perfect success).

Quick overview:

  • Two types: Tool graders (deterministic Python functions) and Rubric graders (LLM-as-judge)
  • Built-in functions: exact_match, contains, regex_match, ascii_printable_only
  • Custom graders: Write your own grading logic
  • Multi-metric: Combine multiple graders in one suite
  • Flexible extraction: Each grader can use a different extractor

When to use each:

  • Tool graders: Fast, deterministic, free - perfect for exact matching, patterns, tool validation
  • Rubric graders: Flexible, subjective, costs API calls - ideal for quality, creativity, nuanced evaluation

Graders evaluate agent responses and assign scores between 0.0 (complete failure) and 1.0 (perfect success).

Grader Types

There are two types of graders:

Tool Graders

Python functions that programmatically compare the submission to ground truth or apply deterministic checks.

1graders:
2 accuracy:
3 kind: tool # Deterministic grading
4 function: exact_match # Built-in grading function
5 extractor: last_assistant # Use final agent response

Best for:

  • Exact matching
  • Pattern checking
  • Tool call validation
  • Deterministic criteria

Rubric Graders

LLM-as-judge evaluation using custom prompts and criteria. Can use either direct LLM API calls or a Letta agent as the judge.

Standard rubric grading (LLM API):

1graders:
2 quality:
3 kind: rubric # LLM-as-judge
4 prompt_path: rubric.txt # Custom evaluation criteria
5 model: gpt-4o-mini # Judge model
6 extractor: last_assistant # What to evaluate

Agent-as-judge (Letta agent):

1graders:
2 agent_judge:
3 kind: rubric # Still "rubric" kind
4 agent_file: judge.af # Judge agent with submit_grade tool
5 prompt_path: rubric.txt # Evaluation criteria
6 extractor: last_assistant # What to evaluate

Best for:

  • Subjective quality assessment
  • Open-ended responses
  • Nuanced evaluation
  • Complex criteria
  • Judges that need tools (when using agent-as-judge)

Built-in Tool Graders

exact_match

Checks if submission exactly matches ground truth (case-sensitive, whitespace-trimmed).

1graders:
2 accuracy:
3 kind: tool
4 function: exact_match # Case-sensitive, whitespace-trimmed
5 extractor: last_assistant # Extract final response

Requires: ground_truth in dataset

Score: 1.0 if exact match, 0.0 otherwise

contains

Checks if submission contains ground truth (case-insensitive).

1graders:
2 contains_answer:
3 kind: tool
4 function: contains # Case-insensitive substring match
5 extractor: last_assistant # Search in final response

Requires: ground_truth in dataset

Score: 1.0 if found, 0.0 otherwise

regex_match

Checks if submission matches a regex pattern in ground truth.

1graders:
2 pattern:
3 kind: tool
4 function: regex_match # Pattern matching
5 extractor: last_assistant # Check final response

Dataset sample:

1{"input": "Generate a UUID", "ground_truth": "[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"}

Score: 1.0 if pattern matches, 0.0 otherwise

ascii_printable_only

Validates that all characters are printable ASCII (useful for ASCII art, formatted output).

1graders:
2 ascii_check:
3 kind: tool
4 function: ascii_printable_only # Validate ASCII characters
5 extractor: last_assistant # Check final response

Does not require ground truth.

Score: 1.0 if all characters are printable ASCII, 0.0 if any non-printable characters found

Rubric Graders

Rubric graders use an LLM to evaluate responses based on custom criteria.

Basic Configuration

1graders:
2 quality:
3 kind: rubric # LLM-as-judge
4 prompt_path: quality_rubric.txt # Evaluation criteria
5 model: gpt-4o-mini # Judge model
6 temperature: 0.0 # Deterministic
7 extractor: last_assistant # What to evaluate

Rubric Prompt Format

Your rubric file should describe the evaluation criteria. Use placeholders:

  • {input}: The original input from the dataset
  • {submission}: The extracted agent response
  • {ground_truth}: Ground truth from dataset (if available)

Example quality_rubric.txt:

Evaluate the response for:
1. Accuracy: Does it correctly answer the question?
2. Completeness: Is the answer thorough?
3. Clarity: Is it well-explained?
Input: {input}
Expected: {ground_truth}
Response: {submission}
Score from 0.0 to 1.0 where:
- 1.0: Perfect response
- 0.75: Good with minor issues
- 0.5: Acceptable but incomplete
- 0.25: Poor quality
- 0.0: Completely wrong

Inline Prompt

Instead of a file, you can include the prompt inline:

1graders:
2 quality:
3 kind: rubric # LLM-as-judge
4 prompt: | # Inline prompt instead of file
5 Evaluate the creativity and originality of the response.
6 Score 1.0 for highly creative, 0.0 for generic or unoriginal.
7 model: gpt-4o-mini # Judge model
8 extractor: last_assistant # What to evaluate

Model Configuration

1graders:
2 quality:
3 kind: rubric
4 prompt_path: rubric.txt # Evaluation criteria
5 model: gpt-4o-mini # Judge model
6 temperature: 0.0 # Deterministic (0.0-2.0)
7 provider: openai # LLM provider (default: openai)
8 max_retries: 5 # API retry attempts
9 timeout: 120.0 # Request timeout in seconds

Supported providers:

  • openai (default)

Models:

  • Any OpenAI-compatible model
  • Special handling for reasoning models (o1, o3) - temperature automatically adjusted to 1.0

Structured Output

Rubric graders use JSON mode to get structured responses:

1{
2 "score": 0.85,
3 "rationale": "The response is accurate and complete but could be more concise."
4}

The score is validated to be between 0.0 and 1.0.

Multi-Metric Configuration

Evaluate multiple aspects in one suite:

1graders:
2 accuracy: # Tool grader for factual correctness
3 kind: tool
4 function: contains
5 extractor: last_assistant
6
7 completeness: # Rubric grader for thoroughness
8 kind: rubric
9 prompt_path: completeness_rubric.txt
10 model: gpt-4o-mini
11 extractor: last_assistant
12
13 tool_usage: # Tool grader for tool call validation
14 kind: tool
15 function: exact_match
16 extractor: tool_arguments # Extract tool call args
17 extractor_config:
18 tool_name: search # Which tool to check

Each grader can use a different extractor.

Extractor Configuration

Every grader must specify an extractor to select what to grade:

1graders:
2 my_metric:
3 kind: tool
4 function: contains # Grading function
5 extractor: last_assistant # What to extract and grade

Some extractors need additional configuration:

1graders:
2 tool_check:
3 kind: tool
4 function: contains # Check if ground truth in tool args
5 extractor: tool_arguments # Extract tool call arguments
6 extractor_config: # Configuration for this extractor
7 tool_name: search # Which tool to extract from

See Extractors for all available extractors.

Custom Graders

You can write custom grading functions. See Custom Graders for details.

Grader Selection Guide

Use CaseRecommended Grader
Exact answer matchingexact_match
Keyword checkingcontains
Pattern validationregex_match
Tool call validationexact_match with tool_arguments extractor
Quality assessmentRubric grader
Creativity evaluationRubric grader
Format checkingCustom tool grader
Multi-criteria evaluationMultiple graders

Score Interpretation

All scores are between 0.0 and 1.0:

  • 1.0: Perfect - meets all criteria
  • 0.75-0.99: Good - minor issues
  • 0.5-0.74: Acceptable - notable gaps
  • 0.25-0.49: Poor - major problems
  • 0.0-0.24: Failed - did not meet criteria

Tool graders typically return binary scores (0.0 or 1.0), while rubric graders can return any value in the range.

Error Handling

If grading fails (e.g., network error, invalid format):

  • Score is set to 0.0
  • Rationale includes error message
  • Metadata includes error details

This ensures evaluations can continue even with individual failures.

Next Steps