vLLM
VLLM_API_BASE
to point to your vLLM ChatCompletitions server.Setting up vLLM
- Download + install vLLM
- Launch a vLLM OpenAI-compatible API server using the official vLLM documentation
For example, if we want to use the model dolphin-2.2.1-mistral-7b
from HuggingFace, we would run:
vLLM will automatically download the model (if it’s not already downloaded) and store it in your HuggingFace cache directory.
Enabling vLLM as a provider
To enable the vLLM provider, you must set the VLLM_API_BASE
environment variable. When this is set, Letta will use available LLM and embedding models running on vLLM.
Using the docker compose
server with vLLM
Since vLLM is running on the host network, you will need to use host.docker.internal
to connect to the vLLM server instead of localhost
.
You’ll also want to make sure to open the port 8000 (the default port for vLLM) on your host machine.
CLI (pypi only)
Using letta run
and letta server
with vLLM
To chat with an agent, run:
To run the Letta server, run:
To select the model used by the server, use the dropdown in the ADE or specify a LLMConfig
object in the Python SDK.