Streaming agent responses

Messages from the Letta server can be streamed to the client. If you’re building a UI on the Letta API, enabling streaming allows your UI to update in real-time as the agent generates a response to an input message.

There are two kinds of streaming you can enable: streaming agent steps and streaming tokens. To enable streaming (either mode), you need to use the /v1/agent/messages/stream API route instead of the /v1/agent/messages API route.

Streaming agent steps

When you send a message to the Letta server, the agent may run multiple steps while generating a response. For example, an agent may run a search query, then use the results of that query to generate a response.

When you use the /messages/stream route, stream_steps is enabled by default, and the response to the POST request will stream back as server-sent events (read more about SSE format here):

1curl --request POST \
2 --url http://localhost:8283/v1/agents/$AGENT_ID/messages/stream \
3 --header 'Content-Type: application/json' \
4 --data '{
5 "messages": [
6 {
7 "role": "user",
8 "text": "hows it going????"
9 }
10 ]
11}'
1data: {"id":"...","date":"...","message_type":"reasoning_message","reasoning":"User keeps asking the same question; maybe it's part of their style or humor. I\u2019ll respond warmly and play along."}
2
3data: {"id":"...","date":"...","message_type":"assistant_message","assistant_message":"Hey! It\u2019s going well! Still here, ready to chat. How about you? Anything exciting happening?"}
4
5data: {"message_type":"usage_statistics","completion_tokens":65,"prompt_tokens":2329,"total_tokens":2394,"step_count":1}
6
7data: [DONE]

Streaming tokens

You can also stream chunks of tokens from the agent as they are generated by the underlying LLM process by setting stream_tokens to true in your API request:

1curl --request POST \
2 --url http://localhost:8283/v1/agents/$AGENT_ID/messages/stream \
3 --header 'Content-Type: application/json' \
4 --data '{
5 "messages": [
6 {
7 "role": "user",
8 "text": "hows it going????"
9 }
10 ],
11 "stream_tokens": true
12}'

With token streaming enabled, the response will look very similar to the prior example (agent steps streaming), but instead of receiving complete messages, the client receives multiple messages with chunks of the response. The client is responsible for reassembling the response from the chunks. We’ve ommited most of the chunks for brevity:

1data: {"id":"...","date":"...","message_type":"reasoning_message","reasoning":"It's"}
2
3data: {"id":"...","date":"...","message_type":"reasoning_message","reasoning":" interesting"}
4
5... chunks ommited
6
7data: {"id":"...","date":"...","message_type":"reasoning_message","reasoning":"!"}
8
9data: {"id":"...","date":"...","message_type":"assistant_message","assistant_message":"Well"}
10
11... chunks ommited
12
13data: {"id":"...","date":"...","message_type":"assistant_message","assistant_message":"."}
14
15data: {"message_type":"usage_statistics","completion_tokens":50,"prompt_tokens":2771,"total_tokens":2821,"step_count":1}
16
17data: [DONE]

Tips on handling streaming in your client code

The data structure for token streaming is the same as for agent steps streaming (LettaMessage) - just instead of returning complete messages, the Letta server will return multiple messages each with a chunk of the response. Because the format of the data looks the same, if you write your frontend code to handle tokens streaming, it will also work for agent steps streaming.

For example, if the Letta server is connected to multiple LLM backend providers and only a subset of them support LLM token streaming, you can use the same frontend code (interacting with the Letta API) to handle both streaming and non-streaming providers. If you send a message to an agent with streaming enabled (stream_tokens are true), the server will stream back LettaMessage objects with chunks if the selected LLM provider supports token streaming, and LettaMessage objects with complete strings if the selected LLM provider does not support token streaming.

Built with