Nous ResearchHermes Agent
Deploy Now

Token Overhead Explained

·hermes agent token overheadtokensoptimizationcost

Exactly where your tokens go in every Hermes request — system prompt, tool definitions, memory, and how to cut the bloat significantly.

Want to try Hermes Agent yourself?

Try Hermes Free → Deploy in 60 seconds

Token overhead is the hidden cost of running autonomous agents. Hermes burns tokens on system prompts, tool definitions, memory, and message history. Here is exactly where your tokens go and how to optimize.

What Burns Tokens

Every Hermes request includes:

Component Tokens %
Tool definitions (31 tools) 8,759 46%
System prompt (SOUL.md + skills catalog) 5,176 27%
Messages ~5,000 avg 27%

Total baseline: ~19,000 tokens per request before you say anything.

CLI vs Telegram Overhead

Critical finding from Discord research:

Access Method Tokens/Request Cost Difference
CLI 6,000-8,000 Baseline
Telegram 15,000-20,000 2-3x higher

Why the gap: The gateway was loading development context files from the hermes-agent repo directory. Fixed in recent versions by starting from the user's home directory.

Still worth knowing: messaging gateways add overhead compared to raw CLI.

Token Cost Projections

Using Reddit user's forensic analysis:

Scenario API Calls Estimated Cost (Sonnet 4.5)
Simple bug fix 20 ~$6
Feature implementation 100 ~$34
Large refactor 500 ~$187
Full project build 1,000 ~$405

With Kimi K2.5: ~$3 for a feature implementation. With DeepSeek on cache hits: under $1.

Optimization Tips

1. Use Cheap Models for Routine Tasks

Reserve Claude/GPT-4 for complex reasoning. Use Kimi, MiniMax, DeepSeek for:

  • File organization tasks
  • Simple message responses
  • Cron job executions
  • Research lookups

2. Platform-Specific Toolsets

Do not load browser tools for a Telegram messaging session:

hermes config set terminal.toolset default  # no browser for messaging gateways

Saves ~1,300 tokens per request.

3. Lazy Skills Loading

The skills system already loads sparsely — skill descriptions only, full content on demand. But disable unused skill categories:

skills:
  disabled_categories: [gaming, fun]

4. Trim MEMORY.md

Full 2,200-char memory injects every request. If you are over budget, consolidate entries.

5. Short Sessions

Long conversation history builds up in context. Start new sessions (hermes --fresh) for unrelated tasks.

Provider Caching Comparison

Not all providers cache equally:

Provider Cache Support Notes
Anthropic Full Claude shows cache markers
OpenRouter Partial Depends on upstream
DeepSeek 90% off on cache Real savings
Kimi K2.5 75% off Native discount
Gemini None Full price every turn
GLM None Full price every turn

Choosing a cache-friendly provider is the single biggest lever for reducing costs.

Real-World Community Numbers

  • WhatsApp group, 168 messages: ~84 API calls = ~1.6M input tokens
  • 21,000 tokens for a simple weather query (when the agent spawned a terminal)
  • "4 million tokens in 2 hours of light usage" — Reddit user who quit

The lesson: understand what triggers high-token operations (terminal spawns, browser automation, complex code execution).

The Fix for Telegram Overhead

If you use Telegram or Discord gateway:

  1. Make sure Hermes starts from your home directory, not the repo directory
  2. Check your config: hermes config show
  3. Run hermes gateway restart after updates
  4. Consider CLI for token-intensive sessions

Cost calculator | VPS guide Compare to Cursor


FAQ

Why are tool definitions so expensive? 31 tools with full schemas take ~8,700 tokens. This is a trade-off: more tools = more capability but higher overhead.

Does prompt caching help? Only with Claude on Anthropic or certain OpenRouter routes. Other providers do not cache.

How do I know my token usage? Community-built dashboard: github.com/Bichev/hermes-dashboard

Reduce costs guide

flyhermes.ai

Frequently Asked Questions

Why does Hermes use 15–20K tokens per request on Telegram but only 6–8K via CLI?

This was a bug in older Hermes versions where the gateway was starting in the hermes-agent repository directory and loading development context files. Updated to v0.6.0 and restarting the gateway fixes this — the gateway now starts from your home directory.

How can I reduce token overhead on every Hermes request?

Three proven optimizations: use platform-specific toolsets so browser tools don't load for Telegram/Discord sessions (~1.3K tokens saved/req), disable unused skill categories in config (~2.2K tokens saved/req), and keep MEMORY.md lean — a full 2,200-char memory injects into every request regardless of relevance.

Which LLM providers offer the best caching to reduce token costs?

DeepSeek offers 90% discount on cache hits — the single biggest cost lever. Kimi K2.5 offers 75% native discount. Anthropic Claude shows explicit cache markers. Gemini and GLM do not cache at all — you pay full price every turn.

What counts as a high-token operation I should be aware of?

Terminal tool spawning adds significant overhead. A simple weather query can reach 21,000 tokens if Hermes decides to use the terminal tool. Browser automation is expensive due to screenshot and vision analysis. Complex code execution with large file reads also stacks token costs quickly.

How do I monitor where my tokens are actually going?

Community member u/Witty_Ticket_4101 built a token forensics dashboard at github.com/Bichev/hermes-dashboard that breaks down token consumption by component. This is the most detailed available tool for understanding your actual Hermes API spend.

Ready to Run Your Own AI Agent?

Self-host Hermes in 60 seconds. No credit card, no cloud lock-in.

Deploy Hermes Free →

Related Posts