Here's something most AI agent marketing gets wrong: "self-improving" is used so loosely it has become meaningless. Every chatbot claims it learns. Almost none of them actually do.
Hermes Agent's Atropos RL integration is different — and not in a marketing way. Atropos is Nous Research's reinforcement learning framework, and it's been built into Hermes from the start. This post explains what that actually means for you as a user, what the benchmarks measure, and why the research team is excited about it in ways that actually matter.
What Reinforcement Learning Actually Means Here
Reinforcement learning, in the context of AI agents, means: the agent's behavior generates training data that improves future versions of the agent.
Most AI agents are static. They run, they respond, they close the loop. The next conversation starts from zero. Hermes with Atropos is different: every task execution is a data point. The agent's trajectory through a problem — what tools it used, what failed, what succeeded, what the outcome was — gets captured and stored.
Those trajectories can then be used to train better models. That's not a metaphor. That's the actual training pipeline.
Atropos: The Infrastructure
Atropos is Nous Research's internal RL framework. In Hermes, it's been made accessible to every user through a simple command:
hermes research run --task "fix_bugs" --output trajectories/
What this does:
- Runs the specified task autonomously — in this example, fixing bugs in a codebase
- Captures the full trajectory — every tool call, every decision, every outcome
- Stores it in the specified directory — as structured data ready for analysis or training
- Exports to ShareGPT format — compatible with standard fine-tuning pipelines
For researchers, this is the path from "using Hermes" to "training on Hermes data." For users, it's mostly happening in the background — but the data being generated is what makes future Hermes versions better at your specific workflows.
TerminalBench and TAU2: What the Benchmarks Measure
Two benchmarks are used to evaluate how well models perform as agents:
TerminalBench (v2) measures tool-use capability in terminal environments — how well a model executes multi-step tasks using command-line tools.
TAU2 (Tool Agent Understanding benchmark) measures how well models handle agentic workflows: planning, tool selection, error recovery, and completion.
These are harder to game than standard LLM benchmarks because they measure behavior, not just text prediction accuracy.
Current Scores
| Model | Score | Benchmark | Notes |
|---|---|---|---|
| GLM | >99% | TAU2 | Used in Nous Research internal training |
| Qwen 3.5 35B A3B | 81.2% | TerminalBench | Strong agentic performance |
| Qwen 3.5 27B | 79% | TAU2 | Community favorite for local部署 |
| Gemma 4 31B | 76.9% | TAU2 | Good reasoning capability |
These numbers matter for users because:
- Model selection is informed by agentic performance, not just general benchmark scores
- High TAU2/TerminalBench scores correlate with actual task completion rates in Hermes
- The community uses these to recommend models — Qwen 3.5 27B scores 79% on TAU2 and is widely recommended as a local model option
The Flywheel: How It Works End-to-End
Here's the complete loop:
Step 1 — You run a task: "Research our top 5 competitors and save findings to ~/research/"
Step 2 — Hermes executes: Browser opens, tabs navigate, screenshots taken, data extracted, file written. 23 tool calls, 2 corrections from you, 1 failed approach that was abandoned.
Step 3 — Atropos captures: The full trajectory is logged. What worked. What didn't. What the task was. What the outcome was.
Step 4 — Trajectories accumulate: After 1,000 tasks across the Hermes user base, you have a massive dataset of real agentic behavior — successful workflows, failed attempts, corrections.
Step 5 — Training data: Trajectories export to ShareGPT format. Research teams (including Nous Research) use this to fine-tune future models.
Step 6 — Better agents: Fine-tuned models perform better at agentic tasks. Hermes runs better. The loop feeds itself.
This is why the Hermes team says users are "training the next generation of tool-calling models" just by running the agent normally.
For Researchers: How to Use the RL Pipeline
If you want to train on Hermes trajectory data:
Collect trajectories:
hermes research run --task "write_tests" --output ./trajectories/write_tests/
hermes research run --task "debug_flaky" --output ./trajectories/debug/
Review what's been captured:
ls ./trajectories/
cat ./trajectories/write_tests/session_001.json | jq '.trajectory | length'
Export for fine-tuning:
hermes research export --format sharegpt --input ./trajectories/ --output ./training_data/
The ShareGPT format is compatible with most fine-tuning frameworks (Axolotl, LLaMA Factory, etc.).
For Regular Users: Why This Matters Without Doing Anything
You don't need to run a single research command to benefit from Atropos integration:
Better model recommendations: TerminalBench/TAU2 scores inform which models the community recommends. You're using better models as a result.
Skills improve over time: The Atropos loop is partly what drives the skill improvement system. Skills created from task executions are informed by trajectory analysis.
Future Hermes versions get better: Every session you run contributes to the training data pool. Future releases inherit learnings from your actual workflows.
Atropos is transparent: Unlike proprietary RL systems, Atropos is documented and the trajectory data is yours. You can inspect what was captured and what wasn't.
What Atropos Doesn't Do
Important to be clear about the boundaries:
Atropos doesn't modify your running Hermes. Your agent doesn't suddenly get smarter mid-session. The improvement happens in future training runs, not in real-time.
Trajectories are stored locally by default. Your task data doesn't automatically upload anywhere. Research commands save to your local directory.
RL training requires technical setup. Running fine-tuning jobs is not trivial. This is research infrastructure, not a one-click button.
Better models still matter most. A 79% TAU2 score is meaningful, but the underlying model's general capability still sets the ceiling. The RL pipeline optimizes behavior within that ceiling.
The Privacy Angle
When you run:
hermes research run --task "fix_bugs" --output trajectories/
The trajectory file is written to your local trajectories/ directory. It contains your conversation, the agent's tool calls, and the outcome. If you don't share it, nobody sees it.
When you run normal Hermes sessions (without the research command), no trajectory data is exported or shared. The Atropos integration is opt-in per session.
FAQ
Do I need to run RL commands to use Hermes?
No. The research commands are entirely optional. Normal Hermes usage benefits from the Atropos infrastructure without any explicit research setup.
Can I see what trajectories are captured?
Yes — run hermes research run --inspect to see what would be captured without actually saving it.
How is this different from OpenAI's RLHF or Claude's Constitutional AI?
Those systems train the base model globally — everyone's data contributes to a shared model. Atropos in Hermes trains on trajectories from your usage — the improvement is specific to the agentic behaviors you're exercising. You also own your trajectory data.
What models benefit most from Hermes trajectory training?
Models already strong at tool use and reasoning (Qwen 3.5, GLM, Gemma) benefit most. The RL pipeline optimizes agentic behavior, not general knowledge.