vLLM | Letta

To use Letta with vLLM, set the environment variable VLLM_API_BASE to point to your vLLM ChatCompletions server.

Setting up vLLM

Download + install vLLM
Launch a vLLM OpenAI-compatible API server using the official vLLM documentation

For example, if we want to use the model dolphin-2.2.1-mistral-7b from HuggingFace, we would run:

1 python -m vllm.entrypoints.openai.api_server \
2 --model ehartford/dolphin-2.2.1-mistral-7b

vLLM will automatically download the model (if it’s not already downloaded) and store it in your HuggingFace cache directory.

Enabling vLLM as a provider

To enable the vLLM provider, you must set the VLLM_API_BASE environment variable. When this is set, Letta will use available LLM and embedding models running on vLLM.

Using the `docker run` server with vLLM

macOS/Windows: Since vLLM is running on the host network, you will need to use host.docker.internal to connect to the vLLM server instead of localhost.

$ # replace `~/.letta/.persist/pgdata` with wherever you want to store your agent data
> docker run \
>   -v ~/.letta/.persist/pgdata:/var/lib/postgresql/data \
>   -p 8283:8283 \
>   -e VLLM_API_BASE="http://host.docker.internal:8000" \
>   letta/letta:latest

Linux: Use --network host and localhost:

$ docker run \
>   -v ~/.letta/.persist/pgdata:/var/lib/postgresql/data \
>   --network host \
>   -e VLLM_API_BASE="http://localhost:8000" \
>   letta/letta:latest

CLI (pypi only)

Using `letta run` and `letta server` with vLLM

To chat with an agent, run:

$ export VLLM_API_BASE="http://localhost:8000"
> letta run

To run the Letta server, run:

$ export VLLM_API_BASE="http://localhost:8000"
> letta server

To select the model used by the server, use the dropdown in the ADE or specify a LLMConfig object in the Python SDK.