To use Letta with vLLM, set the environment variable VLLM_API_BASE to point to your vLLM ChatCompletitions server.

Setting up vLLM

  1. Download + install vLLM
  2. Launch a vLLM OpenAI-compatible API server using the official vLLM documentation

For example, if we want to use the model dolphin-2.2.1-mistral-7b from HuggingFace, we would run:

python -m vllm.entrypoints.openai.api_server \
--model ehartford/dolphin-2.2.1-mistral-7b

vLLM will automatically download the model (if it’s not already downloaded) and store it in your HuggingFace cache directory.

Enabling vLLM as a provider

To enable the vLLM provider, you must set the VLLM_API_BASE environment variable. When this is set, Letta will use available LLM and embedding models running on vLLM.