Ollama
Make sure to use tags when downloading Ollama models!
For example, don’t do ollama pull dolphin2.2-mistral, instead do ollama pull dolphin2.2-mistral:7b-q6_K (add the :7b-q6_K tag).
If you don’t specify a tag, Ollama may default to using a highly compressed model variant (e.g. Q4). We highly recommend NOT using a compression level below Q5 when using GGUF (stick to Q6 or Q8 if possible). In our testing, certain models start to become extremely unstable (when used with Letta/MemGPT) below Q6.
Setup Ollama
- Download + install Ollama and the model you want to test with
- Download a model to test with by running
ollama pull <MODEL_NAME>in the terminal (check the Ollama model library for available models)
For example, if we want to use Dolphin 2.2.1 Mistral, we can download it by running:
Enabling Ollama with Docker
To enable Ollama models when running the Letta server with Docker, set the OLLAMA_BASE_URL environment variable.
macOS/Windows:
Since Ollama is running on the host network, you will need to use host.docker.internal to connect to the Ollama server instead of localhost.
Linux:
Use --network host and localhost:
See the self-hosting guide for more information on running Letta with Docker.
Specifying agent models
When creating agents on your self-hosted server, you must specify both the LLM and embedding models to use via a handle. You can additionally specify a context window limit (which must be less than or equal to the maximum size).
For Letta Cloud usage, see the quickstart guide. Cloud deployments manage embeddings automatically and don’t require provider configuration.