Nous ResearchHermes Agent
Deploy Now

Local LLM Support — AI That Runs on Your Hardware

Key Points

  • Ollama integration out of the box
  • vLLM for high-throughput inference
  • Any OpenAI-compatible server
  • Full privacy — nothing leaves your machine
  • GPU acceleration support
  • Quantized models for lower VRAM

How It Works

  1. 1Install Ollama and pull a model
  2. 2Point Hermes to localhost in config
  3. 3All inference runs locally
  4. 4Combine with cloud models for hybrid setup

Real-World Use Cases

Air-Gapped Deployment

Deploy Hermes on a machine with no internet access. Local inference via Ollama handles all reasoning; local embeddings handle memory search; everything runs on-premises. Ideal for regulated environments, classified work, or simply not trusting cloud providers.

Developer Laptop Setup

Run a quantized model (Llama 3.1 8B Q4, Mistral 7B) on a MacBook with Apple Silicon. Hermes uses the local model for interactive coding assistance, falls back to cloud APIs for heavy-lifting tasks. Zero latency for quick questions; full power when needed.

High-Throughput GPU Server

Deploy vLLM on a machine with multiple GPUs for high-throughput inference. Use Hermes's parallel subagent architecture to saturate the GPU with concurrent requests — effectively a private AI API cluster with no per-token cost.

Hybrid Privacy-Performance Routing

Route sensitive tasks (personal data, company IP) to local models; route commodity tasks (public research, code formatting) to cloud providers. One config file defines the routing rules; Hermes enforces them automatically.

Under the Hood

Hermes implements local LLM support through its unified provider abstraction layer, treating local inference servers identically to cloud providers at the interface level. Ollama is the primary integration — point Hermes at localhost:11434 and it automatically discovers available Ollama models, handles the Ollama-specific API format, and manages context window sizes per model. vLLM's OpenAI-compatible API requires only the base URL configuration. Any server implementing the OpenAI chat completions API (LM Studio, llama.cpp server, Kobold, text-generation-webui) works without additional configuration.

Quantized models are supported transparently — Hermes doesn't need to know whether it's talking to a full-precision cloud model or a Q4_K_M quantized local model. Performance expectations calibrate automatically: the agent's timeout thresholds and retry behavior adapt to measured response latency. On Apple Silicon, Metal GPU acceleration via Ollama delivers 50-80 tokens/second on 7B models, which is sufficient for interactive use with acceptable response times.

The local LLM setup integrates with all other Hermes features: persistent memory, skill loading, tool use, multi-agent orchestration, and the RL pipeline all work identically whether the inference backend is local or cloud. The only capability that doesn't transfer is provider-specific features like OpenAI's code interpreter or Anthropic's extended thinking — those are provider-side features, not Hermes features. The RL pipeline specifically supports local model fine-tuning: collect trajectories with cloud models, run RL training with Atropos on your GPU hardware, deploy the fine-tuned model locally.

Related Features