Ollama
Make sure to use tags when downloading Ollama models!
For example, don’t do ollama pull dolphin2.2-mistral
, instead do ollama pull dolphin2.2-mistral:7b-q6_K
(add the :7b-q6_K
tag).
If you don’t specify a tag, Ollama may default to using a highly compressed model variant (e.g. Q4). We highly recommend NOT using a compression level below Q5 when using GGUF (stick to Q6 or Q8 if possible). In our testing, certain models start to become extremely unstable (when used with Letta/MemGPT) below Q6.
Setup Ollama
- Download + install Ollama and the model you want to test with
- Download a model to test with by running
ollama pull <MODEL_NAME>
in the terminal (check the Ollama model library for available models)
For example, if we want to use Dolphin 2.2.1 Mistral, we can download it by running:
Enabling Ollama as a provider
To enable the Ollama provider, you must set the OLLAMA_BASE_URL
environment variable. When this is set, Letta will use available LLM and embedding models running on Ollama.
Using the docker run
server with Ollama
macOS/Windows:
Since Ollama is running on the host network, you will need to use host.docker.internal
to connect to the Ollama server instead of localhost
.
Linux:
Use --network host
and localhost
:
CLI (pypi only)
Using letta run
and letta server
with Ollama
To chat with an agent, run:
To run the Letta server, run:
To select the model used by the server, use the dropdown in the ADE or specify a LLMConfig
object in the Python SDK.
Specifying agent models
When creating agents, you must specify the LLM and embedding models to use via a handle. You can additionally specify a context window limit (which must be less than or equal to the maximum size).