Running Hermes with local models via Ollama means zero cloud costs and complete privacy — but not all models work equally well for agentic tasks. We polled the Hermes community to find out what actually works.
The Key Requirement: Tool Calling
Hermes is an agent, not a chatbot. The model needs to:
- Recognize when to use tools
- Format tool calls correctly
- Handle multi-step tool chains
- Not refuse agentic actions
Many models can chat well but fail at tool calling. This list focuses on models the community has verified work reliably with Hermes's agentic features.
Top Community Recommendations
Tier 1: Best Performance
Carnice MOE 35B A3B
- Specifically tuned for Hermes tool use
- ~158 tokens/sec on RTX 5090
- Best choice if you have 24GB+ VRAM
- From the Discord: "Carnice series was specifically tuned for Hermes so if you have any issues with Qwen3.5 or it completes something 60% of the time I promise Carnice will do it."
Qwen 3.5 35B A3B
- Strong reasoning and tool calling
- ~60 tokens/sec on RTX 5090
- Good balance of speed and capability
- Q6_K_L quantization recommended
Nemotron 3 Super 120B
- Excellent for complex reasoning tasks
- Requires serious hardware (DGX Spark mentioned)
- Best thinking/reasoning model in the poll
- "nemo3super120b — I like how nemo thinks"
Tier 2: Solid Choices
Gemma 4 27B
- Works on M-series Macs (M5 Max tested)
- "Gemma4 is more rounded" but has known tool issues
- Good for general tasks, less reliable for complex tool chains
Qwen 9B / 35B
- Multiple users running successfully
- Good entry point for smaller setups
MiniMax M2.5 UD Q8
- Paired with Qwen 3.5 by several users
- Good for variety in model switching
Tier 3: Use With Caution
Gemma 4 (smaller variants)
- "I have been having a ton of tool issues with Gemma4"
- Works for chat, unreliable for agentic tasks
Very Small Models (7B and under)
- "4GB VRAM is very low"
- Most 7B models struggle with tool calling
- Not recommended for agentic use
Hardware Requirements
| Model | Min VRAM | Recommended | Speed |
|---|---|---|---|
| Carnice 35B A3B | 24GB | 32GB+ | ~160 tok/s (5090) |
| Qwen 3.5 35B | 24GB | 32GB+ | ~60 tok/s (5090) |
| Gemma 4 27B | 16GB | 24GB | Varies |
| Qwen 9B | 8GB | 12GB | Good on consumer GPUs |
| Nemotron 120B | 48GB+ | DGX | Slow but capable |
Model Configuration Tips
1. Set context length explicitly
model:
provider: ollama
name: carnice-moe-35b-a3b
contextLength: 32768
2. Use quantization wisely
- Q6_K_L: Best quality, needs more VRAM
- Q4_K_M: Good balance for most users
3. Test tool calling first Ask Hermes: "List all the tools you have access to and demonstrate using one."
If it responds like a basic chatbot, the model doesn't support tool calling properly.
Community Setups
Prosumer Build (RTX 4090/5090)
- Carnice 35B A3B as primary
- Qwen 3.5 as backup
Mac M-Series
- Gemma 4 27B on M5 Max works
- Test tool calling — some users report issues
FAQ
Q: My local model just chats but doesn't use tools. Why? A: The model likely doesn't support function/tool calling. Try Carnice, Qwen 3.5, or another Tier 1 model.
Q: What's the minimum VRAM for local Hermes? A: 8GB can run Qwen 9B. For reliable agentic work, 16GB+ is recommended.
Data from Nous Research Discord community poll, April 2026.