Rubric Graders
Use rubric-based grading with predefined criteria to evaluate agent responses consistently.
Rubric graders use language models to evaluate submissions based on custom criteria. They’re ideal for subjective, nuanced evaluation.
Basic Configuration
Section titled “Basic Configuration”graders: quality: kind: rubric prompt_path: quality_rubric.txt # Evaluation criteria model: gpt-4o-mini # Judge model temperature: 0.0 # Deterministic extractor: last_assistant # What to evaluateRubric Prompt Format
Section titled “Rubric Prompt Format”Your rubric file should describe the evaluation criteria. Use placeholders:
{input}: The original input from the dataset{submission}: The extracted agent response{ground_truth}: Ground truth from dataset (if available)
Example quality_rubric.txt:
Evaluate the response for:1. Accuracy: Does it correctly answer the question?2. Completeness: Is the answer thorough?3. Clarity: Is it well-explained?
Input: {input}Expected: {ground_truth}Response: {submission}
Score from 0.0 to 1.0 where:
- 1.0: Perfect response- 0.75: Good with minor issues- 0.5: Acceptable but incomplete- 0.25: Poor quality- 0.0: Completely wrongModel Configuration
Section titled “Model Configuration”graders: quality: kind: rubric prompt_path: rubric.txt model: gpt-4o-mini # Judge model temperature: 0.0 # Deterministic provider: openai # LLM provider max_retries: 5 # API retry attempts timeout: 120.0 # Request timeoutAgent-as-Judge
Section titled “Agent-as-Judge”Use a Letta agent as the judge instead of a direct LLM API call:
graders: agent_judge: kind: rubric agent_file: judge.af # Judge agent with submit_grade tool prompt_path: rubric.txt # Evaluation criteria extractor: last_assistantRequirements: The judge agent must have a tool with signature submit_grade(score: float, rationale: str).
Next Steps
Section titled “Next Steps”- Tool Graders - Deterministic grading functions
- Multi-Metric - Combine multiple graders
- Custom Graders - Write your own grading logic