Best Local Models for Hermes Agent (2026 Community Poll)

·hermes agent local models ollamalocal-modelsollamahardwarecommunity

Community-tested local models that work best with Hermes Agent — from Gemma 4 to Qwen 3.5 to Carnice, ranked by tool-calling reliability.

Running Hermes with local models via Ollama means zero cloud costs and complete privacy — but not all models work equally well for agentic tasks. We polled the Hermes community to find out what actually works.

The Key Requirement: Tool Calling

Hermes is an agent, not a chatbot. The model needs to:

  • Recognize when to use tools
  • Format tool calls correctly
  • Handle multi-step tool chains
  • Not refuse agentic actions

Many models can chat well but fail at tool calling. This list focuses on models the community has verified work reliably with Hermes's agentic features.

Top Community Recommendations

Tier 1: Best Performance

Carnice MOE 35B A3B

  • Specifically tuned for Hermes tool use
  • ~158 tokens/sec on RTX 5090
  • Best choice if you have 24GB+ VRAM
  • From the Discord: "Carnice series was specifically tuned for Hermes so if you have any issues with Qwen3.5 or it completes something 60% of the time I promise Carnice will do it."

Qwen 3.5 35B A3B

  • Strong reasoning and tool calling
  • ~60 tokens/sec on RTX 5090
  • Good balance of speed and capability
  • Q6_K_L quantization recommended

Nemotron 3 Super 120B

  • Excellent for complex reasoning tasks
  • Requires serious hardware (DGX Spark mentioned)
  • Best thinking/reasoning model in the poll
  • "nemo3super120b — I like how nemo thinks"

Tier 2: Solid Choices

Gemma 4 27B

  • Works on M-series Macs (M5 Max tested)
  • "Gemma4 is more rounded" but has known tool issues
  • Good for general tasks, less reliable for complex tool chains

Qwen 9B / 35B

  • Multiple users running successfully
  • Good entry point for smaller setups

MiniMax M2.5 UD Q8

  • Paired with Qwen 3.5 by several users
  • Good for variety in model switching

Tier 3: Use With Caution

Gemma 4 (smaller variants)

  • "I have been having a ton of tool issues with Gemma4"
  • Works for chat, unreliable for agentic tasks

Very Small Models (7B and under)

  • "4GB VRAM is very low"
  • Most 7B models struggle with tool calling
  • Not recommended for agentic use

Hardware Requirements

Model Min VRAM Recommended Speed
Carnice 35B A3B 24GB 32GB+ ~160 tok/s (5090)
Qwen 3.5 35B 24GB 32GB+ ~60 tok/s (5090)
Gemma 4 27B 16GB 24GB Varies
Qwen 9B 8GB 12GB Good on consumer GPUs
Nemotron 120B 48GB+ DGX Slow but capable

Model Configuration Tips

1. Set context length explicitly

model:
  provider: ollama
  name: carnice-moe-35b-a3b
  contextLength: 32768

2. Use quantization wisely

  • Q6_K_L: Best quality, needs more VRAM
  • Q4_K_M: Good balance for most users

3. Test tool calling first Ask Hermes: "List all the tools you have access to and demonstrate using one."

If it responds like a basic chatbot, the model doesn't support tool calling properly.

Community Setups

Prosumer Build (RTX 4090/5090)

  • Carnice 35B A3B as primary
  • Qwen 3.5 as backup

Mac M-Series

  • Gemma 4 27B on M5 Max works
  • Test tool calling — some users report issues

FAQ

Q: My local model just chats but doesn't use tools. Why? A: The model likely doesn't support function/tool calling. Try Carnice, Qwen 3.5, or another Tier 1 model.

Q: What's the minimum VRAM for local Hermes? A: 8GB can run Qwen 9B. For reliable agentic work, 16GB+ is recommended.


Data from Nous Research Discord community poll, April 2026.

Frequently Asked Questions

What's the best local model for Hermes Agent?

Carnice MOE 35B A3B is specifically tuned for Hermes and has the most reliable tool calling. Qwen 3.5 35B is the runner-up with strong reasoning. Both require 24GB+ VRAM.

Why doesn't my local model use tools with Hermes?

Not all models support function/tool calling. Very small models (7B and under) often can't fit Hermes's tool definitions in their context window. Try switching to Carnice, Qwen 3.5, or another Tier 1/2 model.

Can I run Hermes locally on a Mac?

Yes. M-series Macs with unified memory can run models like Gemma 4 27B. Users have confirmed Gemma 4 works on M5 Max, though some report tool calling issues.

What's the minimum VRAM for local Hermes?

8GB VRAM can run Qwen 9B for basic tasks. For reliable agentic work with tool calling, 16GB+ is recommended. The best models need 24GB+.

FlyHermes (Managed Cloud)

Deploy in 60 seconds. API costs included. Cancel anytime.

$29.50/first month →

Self-Host (Open Source)

Full control. MIT licensed. Run on your own infrastructure.

View install guide →

Related Posts