Running Hermes with Ollama means fully local AI — no API costs, no data leaving your machine. Here is how to set it up and which models work best.
Why Ollama
- Free — no per-token costs
- Privacy — all data stays locally
- Offline capable — no internet needed after model download
Trade-off: you need good hardware, and models are generally less capable than cloud LLMs.
Install Ollama
On Mac: brew install ollama
On Linux: check ollama.ai for install instructions
On Windows: Use WSL2
Start the service: ollama serve (runs on localhost:11434)
GPT-4o or ## Install Models
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull llama3:8b
ollama pull mistral:7b
For better agentic performance (7B or higher recommended):
ollama pull qwen2.5:27b
Configure Hermes
In Hermes:
hermes model
Select "Custom Endpoint" or "Ollama". Set URL to http://localhost:11434.
The first call downloads the model if not already in Ollama.
Which Models Work Best for Agentic Tasks
Benchmark data from the Hermes Discord:
| Model | TAU2 Score | Notes |
|---|---|---|
| Qwen 3.5 27B | 79% | Community favorite for local |
| Gemma 4 31B | 76.9% | Strong reasoning |
| GLM | Above 99% | But requires GPU |
For 27B models: 16GB+ VRAM recommended. 7B models run on 8GB+ but with reduced capability.
Performance Trade-offs
| Aspect | Cloud LLM | Ollama Local |
|---|---|---|
| Capability | GPT-4 class | Roughly GPT-3.5 class |
| Speed | Fast API response | Depends on hardware |
| Cost | Per-token | Hardware only |
| Privacy | Data to provider | Fully local |
| Availability | Requires internet | Works offline |
For simple agentic tasks (file management, basic research), local LLM support work fine. For complex reasoning, cloud is better.
VPS Requirements
Ollama on a VPS needs more resources:
- 7B model: 4GB RAM minimum VPS
- 13B model: 8GB RAM
- 27B model: 16GB+ RAM or GPU instance
- 70B model: Dedicated GPU (A100, H100) — $40-80/mo
For cost-effective local: run Ollama on your existing machine (Mac with Apple Silicon handles 7B-14B well).
Common Issues
"Model not found"
Run ollama list to see downloaded models. Pull the model you want.
Slow responses
Larger models on limited hardware are slow. Use 7B or downgrade to smaller.
No internet connection
Ollama models need downloading while online. After that, fully offline.
Web browsing not working
Local models via Ollama may have web tool limitations. Cloud models handle tool calling better.
VPS hosting guide Privacy guide Compare to cloud LLMs
FAQ
Can I use multiple Ollama models?
Yes — switch between them with ollama pull then change the model in Hermes config.
Does this work with the gateway? Yes — run Hermes on a VPS with Ollama, access via Telegram/Discord. All data stays on the VPS.