How-To Guide

Configure Context Compression — Handle Long Conversations

Set up automatic context compression so Hermes can handle long conversations without running out of context window.

Quick answer

Context compression keeps Hermes conversations going past the model's context limit by automatically summarizing older turns to make room for new ones, preserving the important details. Enable and tune it in config to trade a little fidelity on old turns for effectively unlimited conversation length and steadier token use.

Long conversations eventually hit the model's context limit. Hermes's context compression automatically summarizes older turns to make room for new messages — you get infinite conversations without losing important context.

Deploy Hermes faster with FlyHermes

Managed cloud · API costs included · Skill library · Cancel anytime

Before you start:

☑Hermes Agent installed
☑Understanding of token limits and context windows

Steps

1
Enable compression
In config.yaml, set compression: enabled: true
2
Set the threshold
Set compression: threshold: 0.50 to compress at 50% of context limit
3
Configure target ratio
Set compression: target_ratio: 0.20 to preserve 20% as recent context
4
Protect recent messages
Set compression: protect_last_n: 20 to always keep last 20 messages intact
5
Choose a compression model (optional)
Set auxiliary: compression: model: google/gemini-3-flash-preview

Pro Tips

💡Lower threshold values trigger compression earlier — use 0.50 for aggressive compression, 0.80 for late compression
💡The first 3 turns (system prompt, initial request, first response) are always protected
💡Compression uses a fast/cheap model (Gemini Flash) by default — minimal cost impact
💡Use /compress <focus> to manually trigger compression with a specific focus topic
💡If compression succeeds but recall is still wrong, use the memory troubleshooting decision tree to separate active context from durable memory, session history, profiles, and external providers.

Troubleshooting

❌ Agent forgets important earlier context

✅ Increase protect_last_n to keep more recent messages. Also ensure critical information is in MEMORY.md for persistent recall.

❌ Compression happens too often

✅ Increase threshold from 0.50 to 0.70 or 0.80 to delay compression until more of the context window is used.

❌ Compression summaries are low quality

✅ Change the compression model to a higher-quality option: auxiliary: compression: model: anthropic/claude-3-haiku

❌ Context pressure warnings but no compression

✅ Check that compression: enabled: true is set. Also verify your model's context_length is correctly detected.

FAQ

What does context compression do?

It summarizes older conversation turns when you approach the model's context limit, freeing room for new messages while keeping the important context — so conversations don't hard-stop or silently lose the thread.

Does compression lose information?

It trades fidelity on old turns for length. Critical facts should also live in persistent memory, which survives compression; compression manages the live context window, not long-term storage.

How does it affect cost?

It keeps the working context from ballooning, which steadies token usage on long sessions. Pair it with session reset policies for tighter cost control.

Configure Context Compression — Handle Long Conversations

Before you start:

Steps

Enable compression

Set the threshold

Configure target ratio

Protect recent messages

Choose a compression model (optional)

Pro Tips

Troubleshooting

FAQ

What does context compression do?

Does compression lose information?

How does it affect cost?

Related Guides