How-To Guide

Use Hermes Agent with Ollama for Local AI

Run Hermes Agent with local LLMs via Ollama — fully private, no API keys, no cloud dependency.

Quick answer

Run Hermes with Ollama for fully local inference: no API keys, no token costs, no data leaving your machine. The one requirement is context — Hermes needs at least 64K tokens, so start the model with -c 65536 or it is rejected at startup. That single setting is the most common Ollama gotcha.

Running Hermes with Ollama keeps everything on your machine — no API keys, no cloud costs, no data leaving your network. It's the ideal setup for privacy-conscious users or anyone who wants full control over their AI stack.

Deploy Hermes faster with FlyHermes

Managed cloud · API costs included · Skill library · Cancel anytime

Before you start:

☑Hermes Agent installed
☑Ollama installed (ollama.com) on the same or a networked machine
☑Sufficient RAM: 8GB minimum for 7B models, 16GB+ recommended for 13B+ models
☑Optional: NVIDIA or AMD GPU for significantly faster inference

Steps

1
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh — works on Linux and macOS
2
Pull a model
ollama pull hermes3 for the official Hermes 3 model, or any compatible model
3
Configure Hermes
Set model: provider: ollama and model: name: hermes3 in config.yaml
4
Set the endpoint
Ollama defaults to http://localhost:11434 — change if running on another machine
5
Start Hermes
hermes start — all inference runs locally with zero data leaving your machine

Pro Tips

💡The official Hermes 3 model ('ollama pull hermes3') is optimized for tool use and works best with Hermes Agent — start here before trying other models
💡For VPS deployments without a GPU, try smaller quantized models (Q4_K_M) — they run on CPU but are slower
💡Ollama can run on a separate powerful machine while Hermes runs on a smaller server — set 'model: baseUrl: http://your-gpu-machine:11434' in config

Troubleshooting

❌ Ollama connection refused error

✅ Check that Ollama is running: 'ollama serve'. By default it listens on localhost:11434. If running on a different machine, ensure port 11434 is open in your firewall.

❌ Hermes responses are extremely slow with Ollama

✅ You're likely running CPU-only inference on a model too large for your RAM. Try a smaller model ('ollama pull hermes3:7b-q4') or add a GPU to your setup.

❌ Model not found error despite pulling it

✅ Check the model name spelling: 'ollama list' shows installed models. Use the exact name shown, including any variant suffix.

FAQ

Why does Hermes reject my Ollama model?

Context size. Hermes requires at least 64,000 tokens and rejects smaller windows at startup. Start the model with -c 65536 so it has enough context for multi-step tool calls.

Does Ollama really avoid token costs?

Yes for local models — inference runs on your hardware, so there are no per-token API charges and no data leaves your network. Your cost is hardware and electricity.

What if my machine can't run a strong enough model?

Run a hybrid: a local Ollama model for routine work and a hosted model for hard tasks, or use Ollama Cloud (set OLLAMA_API_KEY) for hosted Ollama models without local GPU.

Use Hermes Agent with Ollama for Local AI

Before you start:

Steps

Install Ollama

Pull a model

Configure Hermes

Set the endpoint

Start Hermes

Pro Tips

Troubleshooting

FAQ

Why does Hermes reject my Ollama model?

Does Ollama really avoid token costs?

What if my machine can't run a strong enough model?

Related Guides