vLLM
VLLM_API_BASE
to point to your vLLM ChatCompletions server.Setting up vLLM
- Download + install vLLM
- Launch a vLLM OpenAI-compatible API server using the official vLLM documentation
For example, if we want to use the model dolphin-2.2.1-mistral-7b
from HuggingFace, we would run:
vLLM will automatically download the model (if it’s not already downloaded) and store it in your HuggingFace cache directory.
Enabling vLLM as a provider
To enable the vLLM provider, you must set the VLLM_API_BASE
environment variable. When this is set, Letta will use available LLM and embedding models running on vLLM.
Using the docker run
server with vLLM
macOS/Windows:
Since vLLM is running on the host network, you will need to use host.docker.internal
to connect to the vLLM server instead of localhost
.
Linux:
Use --network host
and localhost
:
CLI (pypi only)
Using letta run
and letta server
with vLLM
To chat with an agent, run:
To run the Letta server, run:
To select the model used by the server, use the dropdown in the ADE or specify a LLMConfig
object in the Python SDK.