Targets

A target is the agent you’re evaluating. In Letta Evals, the target configuration determines how agents are created, accessed, and tested.

Quick overview:

  • Three ways to specify agents: agent file (.af), existing agent ID, or programmatic creation script
  • Critical distinction: agent_file/agent_script create fresh agents per sample (isolated tests), while agent_id uses one agent for all samples (stateful conversation)
  • Multi-model support: Test the same agent configuration across different LLM models
  • Flexible connection: Connect to local Letta servers or Letta Cloud

When to use each approach:

  • agent_file - Pre-configured agents saved as .af files (most common)
  • agent_id - Testing existing agents or multi-turn conversations with state
  • agent_script - Dynamic agent creation with per-sample customization

The target configuration specifies how to create or access the agent for evaluation.

Target Configuration

All targets have a kind field (currently only agent is supported):

1target:
2 kind: agent # Currently only "agent" is supported
3 # ... agent-specific configuration

Agent Sources

You must specify exactly ONE of these:

agent_file

Path to a .af (Agent File) to upload:

1target:
2 kind: agent
3 agent_file: path/to/agent.af # Path to .af file
4 base_url: http://localhost:8283 # Letta server URL

The agent file will be uploaded to the Letta server and a new agent created for the evaluation.

agent_id

ID of an existing agent on the server:

1target:
2 kind: agent
3 agent_id: agent-123-abc # ID of existing agent
4 base_url: http://localhost:8283 # Letta server URL

Modifies agent in-place: Using agent_id will modify your agent’s state, memory, and message history during evaluation. The same agent instance is used for all samples, processing them sequentially. Do not use production agents or agents you don’t want to modify. Use agent_file or agent_script for reproducible, isolated testing.

agent_script

Path to a Python script with an agent factory function for programmatic agent creation:

1target:
2 kind: agent
3 agent_script: create_agent.py:create_inventory_agent # script.py:function_name
4 base_url: http://localhost:8283 # Letta server URL

Format: path/to/script.py:function_name

The function must be decorated with @agent_factory and have the signature async (client: AsyncLetta, sample: Sample) -> str:

1from letta_client import AsyncLetta, CreateBlock
2from letta_evals.decorators import agent_factory
3from letta_evals.models import Sample
4
5@agent_factory
6async def create_inventory_agent(client: AsyncLetta, sample: Sample) -> str:
7 """Create and return agent ID for this sample."""
8 # Access custom arguments from the dataset
9 item = sample.agent_args.get("item", {})
10
11 # Create agent with sample-specific configuration
12 agent = await client.agents.create(
13 name="inventory-assistant",
14 memory_blocks=[
15 CreateBlock(
16 label="item_context",
17 value=f"Item: {item.get('name', 'Unknown')}"
18 )
19 ],
20 agent_type="letta_v1_agent",
21 model="openai/gpt-4.1-mini",
22 embedding="openai/text-embedding-3-small",
23 )
24
25 return agent.id

Key features:

  • Creates a fresh agent for each sample
  • Can customize agents using sample.agent_args from the dataset
  • Allows testing agent creation logic itself
  • Useful when you don’t have pre-saved agent files

When to use:

  • Testing agent creation workflows
  • Dynamic per-sample agent configuration
  • Agents that need sample-specific memory or tools
  • Programmatic agent testing

Connection Configuration

base_url

Letta server URL:

1target:
2 base_url: http://localhost:8283 # Local Letta server
3 # or
4 base_url: https://api.letta.com # Letta Cloud

Default: http://localhost:8283

api_key

API key for authentication (required for Letta Cloud):

1target:
2 api_key: your-api-key-here # Required for Letta Cloud

Or set via environment variable:

$export LETTA_API_KEY=your-api-key-here

project_id

Letta project ID (for Letta Cloud):

1target:
2 project_id: proj_abc123 # Letta Cloud project

Or set via environment variable:

$export LETTA_PROJECT_ID=proj_abc123

timeout

Request timeout in seconds:

1target:
2 timeout: 300.0 # Request timeout (5 minutes)

Default: 300 seconds

Multi-Model Evaluation

Test the same agent across different models:

model_configs

List of model configuration names from JSON files:

1target:
2 kind: agent
3 agent_file: agent.af
4 model_configs: [gpt-4o-mini, claude-3-5-sonnet] # Test with both models

The evaluation will run once for each model config. Model configs are JSON files in letta_evals/llm_model_configs/.

model_handles

List of model handles (cloud-compatible identifiers):

1target:
2 kind: agent
3 agent_file: agent.af
4 model_handles: ["openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"] # Cloud model identifiers

Use this for Letta Cloud deployments.

Note: You cannot specify both model_configs and model_handles.

Complete Examples

Local Development

1target:
2 kind: agent
3 agent_file: ./agents/my_agent.af # Pre-configured agent
4 base_url: http://localhost:8283 # Local server

Letta Cloud

1target:
2 kind: agent
3 agent_id: agent-cloud-123 # Existing cloud agent
4 base_url: https://api.letta.com # Letta Cloud
5 api_key: ${LETTA_API_KEY} # From environment variable
6 project_id: proj_abc # Your project ID

Multi-Model Testing

1target:
2 kind: agent
3 agent_file: agent.af # Same agent configuration
4 base_url: http://localhost:8283 # Local server
5 model_configs: [gpt-4o-mini, gpt-4o, claude-3-5-sonnet] # Test 3 models

Results will include per-model metrics:

Model: gpt-4o-mini - Avg: 0.85, Pass: 85.0%
Model: gpt-4o - Avg: 0.92, Pass: 92.0%
Model: claude-3-5-sonnet - Avg: 0.88, Pass: 88.0%

Programmatic Agent Creation

1target:
2 kind: agent
3 agent_script: setup.py:CustomAgentFactory # Programmatic creation
4 base_url: http://localhost:8283 # Local server

Environment Variable Precedence

Configuration values are resolved in this order (highest priority first):

  1. CLI arguments (--api-key, --base-url, --project-id)
  2. Suite YAML configuration
  3. Environment variables (LETTA_API_KEY, LETTA_BASE_URL, LETTA_PROJECT_ID)

Agent Lifecycle and Testing Behavior

The way your agent is specified fundamentally changes how the evaluation runs:

With agent_file or agent_script: Independent Testing

Agent lifecycle:

  1. A fresh agent instance is created for each sample
  2. Agent processes the sample input(s)
  3. Agent remains on the server after the sample completes

Testing behavior: Each sample is an independent, isolated test. Agent state (memory, message history) does not carry over between samples. This enables parallel execution and ensures reproducible results.

Use cases:

  • Testing how the agent responds to various independent inputs
  • Ensuring consistent behavior across different scenarios
  • Regression testing where each case should be isolated
  • Evaluating agent responses without prior context

Example: If you have 10 test cases, 10 separate agent instances will be created (one per test case), and they can run in parallel.

With agent_id: Sequential Script Testing

Agent lifecycle:

  1. The same agent instance is used for all samples
  2. Agent processes each sample in sequence
  3. Agent state persists throughout the entire evaluation

Testing behavior: The dataset becomes a conversation script where each sample builds on previous ones. Agent memory and message history accumulate, and earlier interactions affect later responses. Samples must execute sequentially.

Use cases:

  • Testing multi-turn conversations with context
  • Evaluating how agent memory evolves over time
  • Simulating a single user session with multiple interactions
  • Testing scenarios where context should accumulate

Example: If you have 10 test cases, they all run against the same agent instance in order, with state carrying over between each test.

Critical Differences

Aspectagent_file / agent_scriptagent_id
Agent instancesNew agent per sampleSame agent for all samples
State isolationFully isolatedState carries over
ExecutionCan run in parallelMust run sequentially
MemoryFresh for each sampleAccumulates across samples
Use caseIndependent test casesConversation scripts
ReproducibilityHighly reproducibleDepends on execution order

Best practice: Use agent_file or agent_script for most evaluations to ensure reproducible, isolated tests. Use agent_id only when you specifically need to test how agent state evolves across multiple interactions.

Validation

The runner validates:

  • Exactly one of agent_file, agent_id, or agent_script is specified
  • Agent files have .af extension
  • Agent script paths are valid

Next Steps