Run Hermes Locally with Ollama — Complete Privacy
Run Hermes entirely offline with Ollama — no API keys, no cloud, complete privacy with local LLMs.
Ollama runs open-source LLMs entirely on your machine. No API keys, no cloud dependency, no data leaves your computer. Perfect for privacy-sensitive work, offline environments, or when you want to avoid API costs entirely.
Managed cloud · API costs included · Skill library · Cancel anytime
Before you start:
- ☑Hermes Agent installed
- ☑Ollama installed
- ☑Sufficient RAM/VRAM for your chosen model (8GB minimum, 24GB+ recommended)
Steps
- 1
Install Ollama
Download and install from ollama.com
- 2
Pull a model
ollama pull qwen3.5:35b or ollama pull gemma4:27b
- 3
Start Ollama server
ollama serve (runs on localhost:11434)
- 4
Configure Hermes
Set model: provider: ollama and model: base_url: http://localhost:11434/v1
- 5
Choose your model
Set model: default: qwen3.5:35b (or your pulled model)
Pro Tips
- 💡Carnice 35B A3B is specifically tuned for Hermes tool calling — most reliable local model
- 💡Qwen 3.5 35B and Gemma 4 27B are popular community choices
- 💡Use Q4_K_M quantization for balance of quality and memory usage
- 💡Set context_length explicitly if auto-detection is wrong
Troubleshooting
❌ Model doesn't use tools
✅ Many local models don't support function calling. Use Carnice, Qwen 3.5, or another Tier 1 model from the community recommendations.
❌ Out of memory
✅ Try a smaller model or lower quantization. 7B models need ~8GB, 35B models need ~24GB VRAM.
❌ Slow responses
✅ Local inference is CPU/GPU bound. Use a smaller model, enable GPU acceleration in Ollama, or accept the speed tradeoff for privacy.
❌ Connection refused
✅ Ensure ollama serve is running. Check that the base_url matches Ollama's actual address (default localhost:11434).