To use Letta with vLLM, set the environment variable VLLM_API_BASE to point to your vLLM ChatCompletitions server.

Setting up vLLM

  1. Download + install vLLM
  2. Launch a vLLM OpenAI-compatible API server using the official vLLM documentation

For example, if we want to use the model dolphin-2.2.1-mistral-7b from HuggingFace, we would run:

1python -m vllm.entrypoints.openai.api_server \
2--model ehartford/dolphin-2.2.1-mistral-7b

vLLM will automatically download the model (if it’s not already downloaded) and store it in your HuggingFace cache directory.

Enabling vLLM as a provider

To enable the vLLM provider, you must set the VLLM_API_BASE environment variable. When this is set, Letta will use available LLM and embedding models running on vLLM.

Using the docker compose server with vLLM

Since vLLM is running on the host network, you will need to use host.docker.internal to connect to the vLLM server instead of localhost. You’ll also want to make sure to open the port 8000 (the default port for vLLM) on your host machine.

$# replace `~/.letta/.persist/pgdata` with wherever you want to store your agent data
>docker run \
> -v ~/.letta/.persist/pgdata:/var/lib/postgresql/data \
> -p 8283:8283 \
> -p 8000:8000 \
> -e VLLM_API_BASE="http://host.docker.internal:8000" \
> letta/letta:latest

Using letta run and letta server with vLLM

To chat with an agent, run:

$export VLLM_API_BASE="http://localhost:8000"
>letta run

To run the Letta server, run:

$export VLLM_API_BASE="http://localhost:8000"
>letta server

To select the model used by the server, use the dropdown in the ADE or specify a LLMConfig object in the Python SDK.

Built with