Rubric Graders

Rubric graders use language models to evaluate submissions based on custom criteria. They’re ideal for subjective, nuanced evaluation.

Rubric graders work by providing the LLM with a prompt that describes the evaluation criteria, then the language model generates a structured JSON response with a score and rationale.

Basic Configuration

1graders:
2 quality:
3 kind: rubric
4 prompt_path: quality_rubric.txt # Evaluation criteria
5 model: gpt-4o-mini # Judge model
6 temperature: 0.0 # Deterministic
7 extractor: last_assistant # What to evaluate

Rubric Prompt Format

Your rubric file should describe the evaluation criteria. Use placeholders:

  • {input}: The original input from the dataset
  • {submission}: The extracted agent response
  • {ground_truth}: Ground truth from dataset (if available)

Example quality_rubric.txt:

Evaluate the response for:
1. Accuracy: Does it correctly answer the question?
2. Completeness: Is the answer thorough?
3. Clarity: Is it well-explained?
Input: {input}
Expected: {ground_truth}
Response: {submission}
Score from 0.0 to 1.0 where:
- 1.0: Perfect response
- 0.75: Good with minor issues
- 0.5: Acceptable but incomplete
- 0.25: Poor quality
- 0.0: Completely wrong

Model Configuration

1graders:
2 quality:
3 kind: rubric
4 prompt_path: rubric.txt
5 model: gpt-4o-mini # Judge model
6 temperature: 0.0 # Deterministic
7 provider: openai # LLM provider
8 max_retries: 5 # API retry attempts
9 timeout: 120.0 # Request timeout

Agent-as-Judge

Use a Letta agent as the judge instead of a direct LLM API call:

1graders:
2 agent_judge:
3 kind: rubric
4 agent_file: judge.af # Judge agent with submit_grade tool
5 prompt_path: rubric.txt # Evaluation criteria
6 extractor: last_assistant

Requirements: The judge agent must have a tool with signature submit_grade(score: float, rationale: str).

Next Steps