Low-latency Agents

Low-latency agents optimize for minimal response time by using a constrained context window and aggressive memory management. They’re ideal for real-time applications like voice interfaces where latency matters more than context retention.

Architecture

Low-latency agents use a much smaller context window than standard MemGPT agents, reducing the time-to-first-token at the cost of much more limited conversation history and memory block size. A sleep-time agent aggressively manages memory to keep only the most relevant information in context.

Key differences from MemGPT v2:

Artificially constrained context window for faster response times
More aggressive memory management with smaller memory blocks
Optimized sleep-time agent tuned for minimal context size
Prioritizes speed over comprehensive context retention

To learn more about how to use low-latency agents for voice applications, see our Voice Agents guide.

Creating Low-latency Agents

Use the voice_convo_agent agent type to create a low-latency agent. Set enable_sleeptime to true to enable the sleep-time agent which will manage the memory state of the low-latency agent in the background. Additionally, set initial_message_sequence to an empty array to start the conversation with no initial messages for a completely empty initial message buffer.

1 from letta_client import Letta
2 
3 client = Letta(token="LETTA_API_KEY")
4 
5 # create the Letta agent
6 agent = client.agents.create(
7     agent_type="voice_convo_agent",
8     memory_blocks=[
9         {"value": "Name: ?", "label": "human"},
10         {"value": "You are a helpful assistant.", "label": "persona"},
11     ],
12     model="openai/gpt-4o-mini", # Use 4o-mini for speed
13     embedding="openai/text-embedding-3-small",
14     enable_sleeptime=True,
15     initial_message_sequence = [],
16 )