vLLM
VLLM_API_BASE to point to your vLLM ChatCompletions server.Setting up vLLM
- Download + install vLLM
- Launch a vLLM OpenAI-compatible API server using the official vLLM documentation
For example, if we want to use the model dolphin-2.2.1-mistral-7b from HuggingFace, we would run:
vLLM will automatically download the model (if it’s not already downloaded) and store it in your HuggingFace cache directory.
Enabling vLLM as a provider
To enable the vLLM provider, you must set the VLLM_API_BASE environment variable. When this is set, Letta will use available LLM and embedding models running on vLLM.
Using the docker run server with vLLM
macOS/Windows:
Since vLLM is running on the host network, you will need to use host.docker.internal to connect to the vLLM server instead of localhost.
Linux:
Use --network host and localhost:
CLI (pypi only)
Using letta run and letta server with vLLM
To chat with an agent, run:
To run the Letta server, run:
To select the model used by the server, use the dropdown in the ADE or specify a LLMConfig object in the Python SDK.