Streaming agent responses

Messages from the Letta server can be streamed to the client. If you’re building a UI on the Letta API, enabling streaming allows your UI to update in real-time as the agent generates a response to an input message.

When working with agents that execute long-running operations (e.g., complex tool calls, extensive searches, or code execution), you may encounter timeouts with the message routes. See our tips on handling long-running tasks for more info.

Quick Start

Letta supports two streaming modes: step streaming (default) and token streaming.

To enable streaming, use the /v1/agents/{agent_id}/messages/stream endpoint instead of /messages:

1# Step streaming (default) - returns complete messages
2stream = client.agents.messages.create_stream(
3 agent_id=agent.id,
4 messages=[{"role": "user", "content": "Hello!"}]
5)
6for chunk in stream:
7 print(chunk) # Complete message objects
8
9# Token streaming - returns partial chunks for real-time UX
10stream = client.agents.messages.create_stream(
11 agent_id=agent.id,
12 messages=[{"role": "user", "content": "Hello!"}],
13 stream_tokens=True # Enable token streaming
14)
15for chunk in stream:
16 print(chunk) # Partial content chunks

Streaming Modes Comparison

AspectStep Streaming (default)Token Streaming
What you getComplete messages after each stepPartial chunks as tokens generate
When to useSimple implementationChatGPT-like real-time UX
Reassembly neededNoYes (by message ID)
Message IDsUnique per messageSame ID across chunks
Content formatFull text in each messageIncremental text pieces
Enable withDefault behaviorstream_tokens: true

Understanding Message Flow

Message Types and Flow Patterns

The messages you receive depend on your agent’s configuration:

With reasoning enabled (default):

  • Simple response: reasoning_messageassistant_message
  • With tool use: reasoning_messagetool_call_messagetool_return_messagereasoning_messageassistant_message

With reasoning disabled (reasoning=false):

  • Simple response: assistant_message
  • With tool use: tool_call_messagetool_return_messageassistant_message

Message Type Reference

  • reasoning_message: Agent’s internal thinking process (only when reasoning=true)
  • assistant_message: The actual response shown to the user
  • tool_call_message: Request to execute a tool
  • tool_return_message: Result from tool execution
  • stop_reason: Indicates end of response (end_turn)
  • usage_statistics: Token usage and step count metrics

Controlling Reasoning Messages

1# With reasoning (default) - includes reasoning_message events
2agent = client.agents.create(
3 model="openai/gpt-4o-mini",
4 # reasoning=True is the default
5)
6
7# Without reasoning - no reasoning_message events
8agent = client.agents.create(
9 model="openai/gpt-4o-mini",
10 reasoning=False # Disable reasoning messages
11)

Step Streaming (Default)

Step streaming delivers complete messages after each agent step completes. This is the default behavior when you use the streaming endpoint.

How It Works

  1. Agent processes your request through steps (reasoning, tool calls, generating responses)
  2. After each step completes, you receive a complete LettaMessage via SSE
  3. Each message can be processed immediately without reassembly

Example

1stream = client.agents.messages.create_stream(
2 agent_id=agent.id,
3 messages=[{"role": "user", "content": "What's 2+2?"}]
4)
5
6for chunk in stream:
7 if hasattr(chunk, 'message_type'):
8 if chunk.message_type == 'reasoning_message':
9 print(f"Thinking: {chunk.reasoning}")
10 elif chunk.message_type == 'assistant_message':
11 print(f"Response: {chunk.content}")

Example Output

data: {"id":"msg-123","message_type":"reasoning_message","reasoning":"User is asking a simple math question."}
data: {"id":"msg-456","message_type":"assistant_message","content":"2 + 2 equals 4!"}
data: {"message_type":"stop_reason","stop_reason":"end_turn"}
data: {"message_type":"usage_statistics","completion_tokens":50,"total_tokens":2821}
data: [DONE]

Token Streaming

Token streaming provides partial content chunks as they’re generated by the LLM, enabling a ChatGPT-like experience where text appears character by character.

How It Works

  1. Set stream_tokens: true in your request
  2. Receive multiple chunks with the same message ID
  3. Each chunk contains a piece of the content
  4. Client must accumulate chunks by ID to rebuild complete messages

Example with Reassembly

1# Token streaming with reassembly
2message_accumulators = {}
3
4stream = client.agents.messages.create_stream(
5 agent_id=agent.id,
6 messages=[{"role": "user", "content": "Tell me a joke"}],
7 stream_tokens=True
8)
9
10for chunk in stream:
11 if hasattr(chunk, 'id') and hasattr(chunk, 'message_type'):
12 msg_id = chunk.id
13 msg_type = chunk.message_type
14
15 # Initialize accumulator for new messages
16 if msg_id not in message_accumulators:
17 message_accumulators[msg_id] = {
18 'type': msg_type,
19 'content': ''
20 }
21
22 # Accumulate content
23 if msg_type == 'reasoning_message':
24 message_accumulators[msg_id]['content'] += chunk.reasoning
25 elif msg_type == 'assistant_message':
26 message_accumulators[msg_id]['content'] += chunk.content
27
28 # Display accumulated content in real-time
29 print(message_accumulators[msg_id]['content'], end='', flush=True)

Example Output

# Same ID across chunks of the same message
data: {"id":"msg-abc","message_type":"assistant_message","content":"Why"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" did"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" the"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" scarecrow"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" win"}
# ... more chunks with same ID
data: [DONE]

Implementation Tips

Universal Handling Pattern

The accumulator pattern shown above works for both streaming modes:

  • Step streaming: Each message is complete (single chunk per ID)
  • Token streaming: Multiple chunks per ID need accumulation

This means you can write your client code once to handle both cases.

SSE Format Notes

All streaming responses follow the Server-Sent Events (SSE) format:

  • Each event starts with data: followed by JSON
  • Stream ends with data: [DONE]
  • Empty lines separate events

Learn more about SSE format here.

Handling Different LLM Providers

If your Letta server connects to multiple LLM providers, some may not support token streaming. Your client code will still work - the server will fall back to step streaming automatically when token streaming isn’t available.