Rubric Graders
Rubric Graders
Section titled “Rubric Graders”Rubric graders, also called “LLM-as-judge” graders, use language models evaluate submissions based on custom criteria. They’re ideal for subjective, nuanced evaluation.
Rubric graders work by providing the LLM with a prompt that describes the evaluation criteria, then the language model generates a structured JSON response with a score and rationale:
{ "score": 0.85, "rationale": "The response is accurate and well-explained, but could be more concise."}Schema requirements:
score(required): Decimal number between 0.0 and 1.0rationale(required): String explanation of the grading decision
Note: OpenAI provides the best support for structured generation. Other providers may have varying quality of structured output adherence.
Overview
Section titled “Overview”Rubric graders:
- Use an LLM to evaluate responses
- Support custom evaluation criteria (rubrics)
- Can handle subjective quality assessment
- Return scores with explanations
- Use JSON structured generation for reliability
Basic Configuration
Section titled “Basic Configuration”graders: quality: kind: rubric prompt_path: rubric.txt # Path to rubric file model: gpt-4o-mini # LLM model extractor: last_assistantRubric Prompts
Section titled “Rubric Prompts”Your rubric file defines the evaluation criteria. It can include placeholders:
{input}: The original input from the dataset{submission}: The extracted agent response{ground_truth}: Ground truth from dataset (if available)
Example Rubric
Section titled “Example Rubric”quality_rubric.txt:
Evaluate the response for accuracy, completeness, and clarity.
Input: {input}Expected answer: {ground_truth}Agent response: {submission}
Scoring criteria:- 1.0: Perfect - accurate, complete, and clear- 0.8-0.9: Excellent - minor improvements possible- 0.6-0.7: Good - some gaps or unclear parts- 0.4-0.5: Adequate - significant issues- 0.2-0.3: Poor - major problems- 0.0-0.1: Failed - incorrect or nonsensical
Provide a score between 0.0 and 1.0.Inline Prompts
Section titled “Inline Prompts”You can include the prompt directly in the YAML:
graders: creativity: kind: rubric prompt: | Evaluate the creativity and originality of the response.
Response: {submission}
Score from 0.0 (generic) to 1.0 (highly creative). model: gpt-4o-mini extractor: last_assistantConfiguration Options
Section titled “Configuration Options”prompt_path vs prompt
Section titled “prompt_path vs prompt”Use exactly one:
# Option 1: External filegraders: quality: kind: rubric prompt_path: rubrics/quality.txt # Relative to suite YAML model: gpt-4o-mini extractor: last_assistant# Option 2: Inlinegraders: quality: kind: rubric prompt: "Evaluate the response quality from 0.0 to 1.0" model: gpt-4o-mini extractor: last_assistantLLM model to use for judging:
graders: quality: kind: rubric prompt_path: rubric.txt model: gpt-4o-mini # Or gpt-4o, claude-3-5-sonnet, etc. extractor: last_assistantSupported: Any OpenAI-compatible model
Special handling: For reasoning models (o1, o3, gpt-5), temperature is automatically set to 1.0 even if you specify 0.0.
temperature
Section titled “temperature”Controls randomness in LLM generation:
graders: quality: kind: rubric prompt_path: rubric.txt model: gpt-4o-mini temperature: 0.0 # Deterministic (default) extractor: last_assistantRange: 0.0 (deterministic) to 2.0 (very random)
Default: 0.0 (recommended for evaluations)
provider
Section titled “provider”LLM provider:
graders: quality: kind: rubric prompt_path: rubric.txt model: gpt-4o-mini provider: openai # Default extractor: last_assistantCurrently supported: openai (default)
max_retries
Section titled “max_retries”Number of retry attempts if API call fails:
graders: quality: kind: rubric prompt_path: rubric.txt model: gpt-4o-mini max_retries: 5 # Default extractor: last_assistanttimeout
Section titled “timeout”Timeout for API calls in seconds:
graders: quality: kind: rubric prompt_path: rubric.txt model: gpt-4o-mini timeout: 120.0 # Default: 2 minutes extractor: last_assistantHow It Works
Section titled “How It Works”- Prompt Building: The rubric prompt is populated with placeholders (
{input},{submission},{ground_truth}) - System Prompt: Instructs the LLM to return JSON with
scoreandrationalefields - Structured Output: Uses JSON mode (
response_format: json_object) to enforce the schema - Validation: Extracts and validates score (clamped to 0.0-1.0) and rationale
- Error Handling: Returns score 0.0 with error message if grading fails
System Prompt
Section titled “System Prompt”The rubric grader automatically includes this system prompt:
You are an evaluation judge. You will be given:1. A rubric describing evaluation criteria2. An input/question3. A submission to evaluate
Evaluate the submission according to the rubric and return a JSON response with:{ "score": (REQUIRED: a decimal number between 0.0 and 1.0 inclusive), "rationale": "explanation of your grading decision"}
IMPORTANT:- The score MUST be a number between 0.0 and 1.0 (inclusive)- 0.0 means complete failure, 1.0 means perfect- Use decimal values for partial credit (e.g., 0.25, 0.5, 0.75)- Be objective and follow the rubric strictlyIf the LLM returns invalid JSON or missing fields, the grading fails and returns score 0.0 with an error message.
Agent-as-Judge
Section titled “Agent-as-Judge”Instead of calling an LLM API directly, you can use a Letta agent as the judge. The agent-as-judge approach loads a Letta agent from a .af file, sends it the evaluation criteria, and collects its score via a tool call.
Why Use Agent-as-Judge?
Section titled “Why Use Agent-as-Judge?”Agent-as-judge is ideal when:
- No direct LLM API access: Your team uses Letta Cloud or managed instances without direct API keys
- Judges need tools: The evaluator needs to call tools during grading (e.g., web search, database queries, fetching webpages to verify answers)
- Centralized LLM access: Your organization provides LLM access only through Letta
- Custom evaluation logic: You want the judge to use specific tools or follow complex evaluation workflows
- Teacher-student patterns: You have a well-built, experienced agent that can evaluate and teach a student agent being developed
Configuration
Section titled “Configuration”To use agent-as-judge, specify agent_file instead of model:
graders: agent_judge: kind: rubric # Still "rubric" kind agent_file: judge.af # Path to judge agent .af file prompt_path: rubric.txt # Evaluation criteria judge_tool_name: submit_grade # Tool for submitting scores (default: submit_grade) extractor: last_assistant # What to extract from target agentKey differences from standard rubric grading:
- Use
agent_fileinstead ofmodel - No
temperature,provider,max_retries, ortimeoutfields (agent handles retries internally) - Judge agent must have a
submit_grade(score: float, rationale: str)tool - Framework validates judge tool on initialization (fail-fast)
Judge Agent Requirements
Section titled “Judge Agent Requirements”Your judge agent must have a tool with this exact signature:
def submit_grade(score: float, rationale: str) -> dict: """ Submit an evaluation grade for an agent's response.
Args: score: A float between 0.0 (complete failure) and 1.0 (perfect) rationale: Explanation of why this score was given
Returns: dict: Confirmation of grade submission """ return { "status": "success", "grade": {"score": score, "rationale": rationale} }Validation on initialization: The framework validates the judge agent has the correct tool with the right parameters before running evaluations. If validation fails, you’ll get a clear error:
ValueError: Judge tool 'submit_grade' not found in agent file judge.af.Available tools: ['fetch_webpage', 'search_documents']This fail-fast approach catches configuration errors immediately.
Checklist: Will Your Judge Agent Work?
Section titled “Checklist: Will Your Judge Agent Work?”- Tool exists: Agent has a tool with the name specified in
judge_tool_name(default:submit_grade) - Tool parameters: The tool has BOTH
score: floatandrationale: strparameters - Tool is callable: The tool is not disabled or requires-approval-only
- Agent system prompt: Agent understands it’s an evaluator (optional but recommended)
- No conflicting tools: Agent doesn’t have other tools that might confuse it into answering questions instead of judging
Example Configuration
Section titled “Example Configuration”suite.yaml:
name: fetch-webpage-agent-judge-testdescription: Test agent responses using a Letta agent as judgedataset: dataset.csv
target: kind: agent agent_file: my_agent.af # Agent being tested base_url: http://localhost:8283
graders: agent_judge: kind: rubric agent_file: judge.af # Judge agent with submit_grade tool prompt_path: rubric.txt # Evaluation criteria judge_tool_name: submit_grade # Tool name (default: submit_grade) extractor: last_assistant # Extract target agent's response
gate: metric_key: agent_judge op: gte value: 0.75 # Pass if avg score ≥ 0.75rubric.txt:
Evaluate the agent's response based on the following criteria:
1. **Correctness (0.6 weight)**: Does the response contain accurate information from the webpage? Check if the answer matches what was requested in the input.
2. **Format (0.2 weight)**: Is the response formatted correctly? The input often requests answers in a specific format (e.g., in brackets like {Answer}).
3. **Completeness (0.2 weight)**: Does the response fully address the question without unnecessary information?
Scoring Guidelines:- 1.0: Perfect response - correct, properly formatted, and complete- 0.75-0.99: Good response - minor formatting or completeness issues- 0.5-0.74: Adequate response - correct information but format/completeness problems- 0.25-0.49: Poor response - partially correct or missing key information- 0.0-0.24: Failed response - incorrect or no relevant information
Use the submit_grade tool to submit your evaluation with a score between 0.0 and 1.0. You will need to use your fetch_webpage tool to fetch the desired webpage and confirm the answer is correct.Judge agent with tools: The judge agent in this example has fetch_webpage tool, allowing it to independently verify answers by fetching the webpage mentioned in the input.
How Agent-as-Judge Works
Section titled “How Agent-as-Judge Works”- Agent Loading: Loads judge agent from
.affile and validates tool signature - Prompt Formatting: Formats the rubric with
{input},{submission},{ground_truth}placeholders - Agent Evaluation: Sends formatted prompt to judge agent as a message
- Tool Call Parsing: Extracts score and rationale from
submit_gradetool call - Cleanup: Deletes judge agent after evaluation to free resources
- Error Handling: Returns score 0.0 with error message if judge fails to call the tool
Agent-as-Judge vs Standard Rubric Grading
Section titled “Agent-as-Judge vs Standard Rubric Grading”| Feature | Standard Rubric | Agent-as-Judge |
|---|---|---|
| LLM Access | Direct API (OPENAI_API_KEY) | Through Letta agent |
| Tools | No tool usage | Judge can use tools |
| Configuration | model, temperature, etc. | agent_file, judge_tool_name |
| Output Format | JSON structured output | Tool call with score/rationale |
| Validation | Runtime JSON parsing | Upfront tool signature validation |
| Use Case | Teams with API access | Teams using Letta Cloud, judges needing tools |
| Cost | API call per sample | Depends on judge agent’s LLM config |
Teacher-Student Pattern
Section titled “Teacher-Student Pattern”A powerful use case for agent-as-judge is the teacher-student pattern, where an experienced, well-configured agent evaluates a student agent being developed.
Prerequisites: This pattern assumes you already have a well-defined, production-ready agent that performs well on your task. This agent becomes the “teacher” that evaluates the “student” agent you’re developing.
Why this works:
- Domain expertise: The teacher agent has specialized knowledge and tools
- Consistent evaluation: The teacher applies the same standards across all evaluations
- Tool-based verification: The teacher can independently verify answers using its own tools
- Iterative improvement: Use the teacher to evaluate multiple versions of the student as you improve it
Example scenario: You have a production-ready customer support agent with domain expertise and access to your tools (knowledge base, CRM, documentation search, etc.). You’re developing a new, faster version of this agent. Use the experienced agent as the judge to evaluate whether the new agent meets the same quality standards.
Configuration:
name: student-agent-evaluationdescription: Experienced agent evaluates student agent performancedataset: support_questions.csv
target: kind: agent agent_file: student_agent.af # New agent being developed base_url: http://localhost:8283
graders: teacher_evaluation: kind: rubric agent_file: teacher_agent.af # Experienced production agent with domain tools prompt: | You are an experienced customer support agent evaluating a new agent's response.
Customer question: {input} Student agent's answer: {submission}
Use your available tools to verify the answer is correct and complete. Grade based on: 1. Factual accuracy (0.5 weight) - Does the answer contain correct information? 2. Completeness (0.3 weight) - Does it fully address the question? 3. Tone and professionalism (0.2 weight) - Is it appropriately worded?
Submit a score from 0.0 to 1.0 using the submit_grade tool. extractor: last_assistant
gate: metric_key: teacher_evaluation op: gte value: 0.8 # Student must score 80% or higherBenefits of this approach:
- Leverage existing expertise: Your best agent becomes the standard
- Scalable quality control: Teacher evaluates hundreds of scenarios automatically
- Continuous validation: Run teacher evaluations in CI/CD as you iterate on the student
- Transfer learning: Teacher’s evaluation helps identify where the student needs improvement
Complete Example
Section titled “Complete Example”See examples/letta-agent-rubric-grader/ for a working example with:
- Judge agent with
submit_gradeandfetch_webpagetools - Target agent that fetches webpages and answers questions
- Rubric that instructs judge to verify answers independently
- Complete suite configuration
Use Cases
Section titled “Use Cases”Quality Assessment
Section titled “Quality Assessment”graders: quality: kind: rubric prompt_path: quality_rubric.txt model: gpt-4o-mini extractor: last_assistantquality_rubric.txt:
Evaluate response quality based on:1. Accuracy of information2. Completeness of answer3. Clarity of explanation
Response: {submission}Ground truth: {ground_truth}
Score from 0.0 to 1.0.Creativity Evaluation
Section titled “Creativity Evaluation”graders: creativity: kind: rubric prompt: | Rate the creativity and originality of this story.
Story: {submission}
1.0 = Highly creative and original 0.5 = Some creative elements 0.0 = Generic or cliché model: gpt-4o-mini extractor: last_assistantMulti-Criteria Evaluation
Section titled “Multi-Criteria Evaluation”graders: comprehensive: kind: rubric prompt: | Evaluate the response on multiple criteria:
1. Technical Accuracy (40%) 2. Clarity of Explanation (30%) 3. Completeness (20%) 4. Conciseness (10%)
Input: {input} Response: {submission} Expected: {ground_truth}
Provide a weighted score from 0.0 to 1.0. model: gpt-4o extractor: last_assistantCode Quality
Section titled “Code Quality”graders: code_quality: kind: rubric prompt: | Evaluate this code for: - Correctness - Readability - Efficiency - Best practices
Code: {submission}
Score from 0.0 to 1.0. model: gpt-4o extractor: last_assistantTone and Style
Section titled “Tone and Style”graders: professionalism: kind: rubric prompt: | Rate the professionalism and appropriate tone of the response.
Response: {submission}
1.0 = Highly professional 0.5 = Acceptable 0.0 = Unprofessional or inappropriate model: gpt-4o-mini extractor: last_assistantBest Practices
Section titled “Best Practices”1. Clear Scoring Criteria
Section titled “1. Clear Scoring Criteria”Provide explicit score ranges and what they mean:
Score:- 1.0: Perfect response with no issues- 0.8-0.9: Minor improvements possible- 0.6-0.7: Some gaps or errors- 0.4-0.5: Significant problems- 0.2-0.3: Major issues- 0.0-0.1: Complete failure2. Use Ground Truth When Available
Section titled “2. Use Ground Truth When Available”If you have expected answers, include them:
Expected: {ground_truth}Actual: {submission}
Evaluate how well the actual response matches the expected content.3. Be Specific About Criteria
Section titled “3. Be Specific About Criteria”Vague: “Evaluate the quality” Better: “Evaluate accuracy, completeness, and clarity”
4. Use Examples in Rubric
Section titled “4. Use Examples in Rubric”Example of 1.0: "A complete, accurate answer with clear explanation"Example of 0.5: "Partially correct but missing key details"Example of 0.0: "Incorrect or irrelevant response"5. Calibrate with Test Cases
Section titled “5. Calibrate with Test Cases”Run on a small set first to ensure the rubric produces expected scores.
6. Consider Model Choice
Section titled “6. Consider Model Choice”- gpt-4o-mini: Fast and cost-effective for simple criteria
- gpt-4o: More accurate for complex evaluation
- claude-3-5-sonnet: Alternative perspective (via OpenAI-compatible endpoint)
Environment Setup
Section titled “Environment Setup”Rubric graders require an OpenAI API key:
export OPENAI_API_KEY=your-api-keyFor custom endpoints:
export OPENAI_BASE_URL=https://your-endpoint.com/v1Error Handling
Section titled “Error Handling”If grading fails:
- Score is set to 0.0
- Rationale includes error message
- Metadata includes error details
- Evaluation continues (doesn’t stop the suite)
Common errors:
- API timeout → Check
timeoutsetting - Invalid API key → Verify
OPENAI_API_KEY - Rate limit → Reduce concurrency or add retries
Cost Considerations
Section titled “Cost Considerations”Rubric graders make API calls for each sample:
- gpt-4o-mini: ~$0.00015 per evaluation (cheap)
- gpt-4o: ~$0.002 per evaluation (more expensive)
For 1000 samples:
- gpt-4o-mini: ~$0.15
- gpt-4o: ~$2.00
Estimate costs before running large evaluations.
Performance
Section titled “Performance”Rubric graders are slower than tool graders:
- Tool grader: <1ms per sample
- Rubric grader: 500-2000ms per sample (network + LLM)
Use concurrency to speed up:
letta-evals run suite.yaml --max-concurrent 10Limitations
Section titled “Limitations”Rubric graders:
- Cost: API calls cost money
- Speed: Slower than tool graders
- Consistency: Can vary slightly between runs (use temperature 0.0 for best consistency)
- API dependency: Requires network and API availability
For deterministic, fast evaluation, use Tool Graders.
Combining Tool and Rubric Graders
Section titled “Combining Tool and Rubric Graders”Use both in one suite:
graders: format_check: kind: tool function: regex_match extractor: last_assistant
quality: kind: rubric prompt_path: quality.txt model: gpt-4o-mini extractor: last_assistant
gate: metric_key: quality # Gate on quality, but still check format op: gte value: 0.7This combines fast deterministic checks with nuanced quality evaluation.
Next Steps
Section titled “Next Steps”- Built-in Extractors - Understanding what to extract from trajectories
- Tool Graders - Deterministic evaluation for objective criteria
- Multi-Metric Evaluation - Combining multiple graders
- Custom Graders - Writing custom evaluation logic