Nous ResearchHermes Agent
Deploy Now

Local AI with Ollama

·hermes agent ollamaollamalocalprivacy

Run Hermes Agent fully offline with Ollama — no API keys, no cloud, your data never leaves your machine. Full setup guide.

Want to try Hermes Agent yourself?

Try Hermes Free → Deploy in 60 seconds

Running Hermes with Ollama means fully local AI — no API costs, no data leaving your machine. Here is how to set it up and which models work best.

Why Ollama

  • Free — no per-token costs
  • Privacy — all data stays locally
  • Offline capable — no internet needed after model download

Trade-off: you need good hardware, and models are generally less capable than cloud LLMs.

Install Ollama

On Mac: brew install ollama On Linux: check ollama.ai for install instructions On Windows: Use WSL2

Start the service: ollama serve (runs on localhost:11434)

GPT-4o or ## Install Models

ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull llama3:8b
ollama pull mistral:7b

For better agentic performance (7B or higher recommended):

ollama pull qwen2.5:27b

Configure Hermes

In Hermes:

hermes model

Select "Custom Endpoint" or "Ollama". Set URL to http://localhost:11434.

The first call downloads the model if not already in Ollama.

Which Models Work Best for Agentic Tasks

Benchmark data from the Hermes Discord:

Model TAU2 Score Notes
Qwen 3.5 27B 79% Community favorite for local
Gemma 4 31B 76.9% Strong reasoning
GLM Above 99% But requires GPU

For 27B models: 16GB+ VRAM recommended. 7B models run on 8GB+ but with reduced capability.

Performance Trade-offs

Aspect Cloud LLM Ollama Local
Capability GPT-4 class Roughly GPT-3.5 class
Speed Fast API response Depends on hardware
Cost Per-token Hardware only
Privacy Data to provider Fully local
Availability Requires internet Works offline

For simple agentic tasks (file management, basic research), local LLM support work fine. For complex reasoning, cloud is better.

VPS Requirements

Ollama on a VPS needs more resources:

  • 7B model: 4GB RAM minimum VPS
  • 13B model: 8GB RAM
  • 27B model: 16GB+ RAM or GPU instance
  • 70B model: Dedicated GPU (A100, H100) — $40-80/mo

For cost-effective local: run Ollama on your existing machine (Mac with Apple Silicon handles 7B-14B well).

Common Issues

"Model not found"

Run ollama list to see downloaded models. Pull the model you want.

Slow responses

Larger models on limited hardware are slow. Use 7B or downgrade to smaller.

No internet connection

Ollama models need downloading while online. After that, fully offline.

Web browsing not working

Local models via Ollama may have web tool limitations. Cloud models handle tool calling better.

VPS hosting guide Privacy guide Compare to cloud LLMs


FAQ

Can I use multiple Ollama models? Yes — switch between them with ollama pull then change the model in Hermes config.

Does this work with the gateway? Yes — run Hermes on a VPS with Ollama, access via Telegram/Discord. All data stays on the VPS.

Run Hermes offline

flyhermes.ai

Frequently Asked Questions

What are the minimum hardware requirements to run Hermes with Ollama locally?

7B models need 8GB+ RAM total (Apple Silicon Macs handle these well), 13B models need 16GB+, 27B models need 32GB+ or a GPU, and 70B models require a dedicated GPU instance at $40–80/month.

Which Ollama models perform best for agentic tasks?

Qwen 3.5 27B scores 79% on the TAU2 agentic benchmark and is the community favorite for local deployment. Gemma 4 31B scores 76.9%. Smaller 7B models handle simple file management adequately but struggle with complex multi-step tasks and browser automation.

Can Hermes running on Ollama access the web?

Local Ollama models may not have robust web browsing tool definitions built in — the capability depends on the model's native tool-use training. Cloud models (Kimi, DeepSeek, Claude) handle web tasks more reliably.

How does Hermes running fully offline with Ollama compare to cloud API access?

Zero per-token cost after hardware purchase, complete data privacy, and full offline capability are the benefits. The trade-off is lower reasoning capability and slower response times on larger models without GPU acceleration.

What Ollama models should I download first?

Start with `ollama pull qwen2.5:7b` for a capable lightweight model, or `ollama pull qwen2.5:14b` if you have the RAM. For better agentic performance, `ollama pull qwen2.5:27b` is the recommended local option.

Ready to Run Your Own AI Agent?

Self-host Hermes in 60 seconds. No credit card, no cloud lock-in.

Deploy Hermes Free →

Related Posts