Multi-Metric Evaluation

Evaluate multiple aspects of agent performance simultaneously in a single evaluation suite.

Multi-metric evaluation allows you to define multiple graders, each measuring a different dimension of your agent’s behavior.

Why Multiple Metrics?

Agents are complex systems. You might want to evaluate:

  • Correctness: Does the answer match the expected output?
  • Quality: Is the explanation clear and complete?
  • Tool usage: Does the agent call the right tools with correct arguments?
  • Memory: Does the agent correctly update its memory blocks?
  • Format: Does the output follow required formatting rules?

Configuration

1graders:
2 accuracy: # Check if answer is correct
3 kind: tool
4 function: exact_match
5 extractor: last_assistant
6
7 completeness: # LLM judges response quality
8 kind: rubric
9 prompt_path: rubrics/completeness.txt
10 model: gpt-4o-mini
11 extractor: last_assistant
12
13 tool_usage: # Verify correct tool was called
14 kind: tool
15 function: contains
16 extractor: tool_arguments
17 extractor_config:
18 tool_name: search

Gating on One Metric

The gate can check any of these metrics:

1gate:
2 metric_key: accuracy # Gate on accuracy (others still computed)
3 op: gte
4 value: 0.9

Results will include scores for all graders, even if you only gate on one.

Next Steps