Token overhead is the hidden cost of running autonomous agents. Hermes burns tokens on system prompts, tool definitions, memory, and message history. Here is exactly where your tokens go and how to optimize.
What Burns Tokens
Every Hermes request includes:
| Component | Tokens | % |
|---|---|---|
| Tool definitions (31 tools) | 8,759 | 46% |
| System prompt (SOUL.md + skills catalog) | 5,176 | 27% |
| Messages | ~5,000 avg | 27% |
Total baseline: ~19,000 tokens per request before you say anything.
CLI vs Telegram Overhead
Critical finding from Discord research:
| Access Method | Tokens/Request | Cost Difference |
|---|---|---|
| CLI | 6,000-8,000 | Baseline |
| Telegram | 15,000-20,000 | 2-3x higher |
Why the gap: The gateway was loading development context files from the hermes-agent repo directory. Fixed in recent versions by starting from the user's home directory.
Still worth knowing: messaging gateways add overhead compared to raw CLI.
Token Cost Projections
Using Reddit user's forensic analysis:
| Scenario | API Calls | Estimated Cost (Sonnet 4.5) |
|---|---|---|
| Simple bug fix | 20 | ~$6 |
| Feature implementation | 100 | ~$34 |
| Large refactor | 500 | ~$187 |
| Full project build | 1,000 | ~$405 |
With Kimi K2.5: ~$3 for a feature implementation. With DeepSeek on cache hits: under $1.
Optimization Tips
1. Use Cheap Models for Routine Tasks
Reserve Claude/GPT-4 for complex reasoning. Use Kimi, MiniMax, DeepSeek for:
- File organization tasks
- Simple message responses
- Cron job executions
- Research lookups
2. Platform-Specific Toolsets
Do not load browser tools for a Telegram messaging session:
hermes config set terminal.toolset default # no browser for messaging gateways
Saves ~1,300 tokens per request.
3. Lazy Skills Loading
The skills system already loads sparsely — skill descriptions only, full content on demand. But disable unused skill categories:
skills:
disabled_categories: [gaming, fun]
4. Trim MEMORY.md
Full 2,200-char memory injects every request. If you are over budget, consolidate entries.
5. Short Sessions
Long conversation history builds up in context. Start new sessions (hermes --fresh) for unrelated tasks.
Provider Caching Comparison
Not all providers cache equally:
| Provider | Cache Support | Notes |
|---|---|---|
| Anthropic | Full | Claude shows cache markers |
| OpenRouter | Partial | Depends on upstream |
| DeepSeek | 90% off on cache | Real savings |
| Kimi K2.5 | 75% off | Native discount |
| Gemini | None | Full price every turn |
| GLM | None | Full price every turn |
Choosing a cache-friendly provider is the single biggest lever for reducing costs.
Real-World Community Numbers
- WhatsApp group, 168 messages: ~84 API calls = ~1.6M input tokens
- 21,000 tokens for a simple weather query (when the agent spawned a terminal)
- "4 million tokens in 2 hours of light usage" — Reddit user who quit
The lesson: understand what triggers high-token operations (terminal spawns, browser automation, complex code execution).
The Fix for Telegram Overhead
If you use Telegram or Discord gateway:
- Make sure Hermes starts from your home directory, not the repo directory
- Check your config:
hermes config show - Run
hermes gateway restartafter updates - Consider CLI for token-intensive sessions
Cost calculator | VPS guide Compare to Cursor
FAQ
Why are tool definitions so expensive? 31 tools with full schemas take ~8,700 tokens. This is a trade-off: more tools = more capability but higher overhead.
Does prompt caching help? Only with Claude on Anthropic or certain OpenRouter routes. Other providers do not cache.
How do I know my token usage? Community-built dashboard: github.com/Bichev/hermes-dashboard