Nous ResearchHermes Agent
Deploy Now

Atropos RL Integration

·hermes agent reinforcement learningrlatropostraining

Atropos RL integration and what it really means for everyday Hermes users — a technical explainer without requiring a machine learning PhD.

Want to try Hermes Agent yourself?

Try Hermes Free → Deploy in 60 seconds

Here's something most AI agent marketing gets wrong: "self-improving" is used so loosely it has become meaningless. Every chatbot claims it learns. Almost none of them actually do.

Hermes Agent's Atropos RL integration is different — and not in a marketing way. Atropos is Nous Research's reinforcement learning framework, and it's been built into Hermes from the start. This post explains what that actually means for you as a user, what the benchmarks measure, and why the research team is excited about it in ways that actually matter.

What Reinforcement Learning Actually Means Here

Reinforcement learning, in the context of AI agents, means: the agent's behavior generates training data that improves future versions of the agent.

Most AI agents are static. They run, they respond, they close the loop. The next conversation starts from zero. Hermes with Atropos is different: every task execution is a data point. The agent's trajectory through a problem — what tools it used, what failed, what succeeded, what the outcome was — gets captured and stored.

Those trajectories can then be used to train better models. That's not a metaphor. That's the actual training pipeline.

Atropos: The Infrastructure

Atropos is Nous Research's internal RL framework. In Hermes, it's been made accessible to every user through a simple command:

hermes research run --task "fix_bugs" --output trajectories/

What this does:

  1. Runs the specified task autonomously — in this example, fixing bugs in a codebase
  2. Captures the full trajectory — every tool call, every decision, every outcome
  3. Stores it in the specified directory — as structured data ready for analysis or training
  4. Exports to ShareGPT format — compatible with standard fine-tuning pipelines

For researchers, this is the path from "using Hermes" to "training on Hermes data." For users, it's mostly happening in the background — but the data being generated is what makes future Hermes versions better at your specific workflows.

TerminalBench and TAU2: What the Benchmarks Measure

Two benchmarks are used to evaluate how well models perform as agents:

TerminalBench (v2) measures tool-use capability in terminal environments — how well a model executes multi-step tasks using command-line tools.

TAU2 (Tool Agent Understanding benchmark) measures how well models handle agentic workflows: planning, tool selection, error recovery, and completion.

These are harder to game than standard LLM benchmarks because they measure behavior, not just text prediction accuracy.

Current Scores

Model Score Benchmark Notes
GLM >99% TAU2 Used in Nous Research internal training
Qwen 3.5 35B A3B 81.2% TerminalBench Strong agentic performance
Qwen 3.5 27B 79% TAU2 Community favorite for local部署
Gemma 4 31B 76.9% TAU2 Good reasoning capability

These numbers matter for users because:

  1. Model selection is informed by agentic performance, not just general benchmark scores
  2. High TAU2/TerminalBench scores correlate with actual task completion rates in Hermes
  3. The community uses these to recommend models — Qwen 3.5 27B scores 79% on TAU2 and is widely recommended as a local model option

The Flywheel: How It Works End-to-End

Here's the complete loop:

Step 1 — You run a task: "Research our top 5 competitors and save findings to ~/research/"

Step 2 — Hermes executes: Browser opens, tabs navigate, screenshots taken, data extracted, file written. 23 tool calls, 2 corrections from you, 1 failed approach that was abandoned.

Step 3 — Atropos captures: The full trajectory is logged. What worked. What didn't. What the task was. What the outcome was.

Step 4 — Trajectories accumulate: After 1,000 tasks across the Hermes user base, you have a massive dataset of real agentic behavior — successful workflows, failed attempts, corrections.

Step 5 — Training data: Trajectories export to ShareGPT format. Research teams (including Nous Research) use this to fine-tune future models.

Step 6 — Better agents: Fine-tuned models perform better at agentic tasks. Hermes runs better. The loop feeds itself.

This is why the Hermes team says users are "training the next generation of tool-calling models" just by running the agent normally.

For Researchers: How to Use the RL Pipeline

If you want to train on Hermes trajectory data:

Collect trajectories:

hermes research run --task "write_tests" --output ./trajectories/write_tests/
hermes research run --task "debug_flaky" --output ./trajectories/debug/

Review what's been captured:

ls ./trajectories/
cat ./trajectories/write_tests/session_001.json | jq '.trajectory | length'

Export for fine-tuning:

hermes research export --format sharegpt --input ./trajectories/ --output ./training_data/

The ShareGPT format is compatible with most fine-tuning frameworks (Axolotl, LLaMA Factory, etc.).

For Regular Users: Why This Matters Without Doing Anything

You don't need to run a single research command to benefit from Atropos integration:

  • Better model recommendations: TerminalBench/TAU2 scores inform which models the community recommends. You're using better models as a result.

  • Skills improve over time: The Atropos loop is partly what drives the skill improvement system. Skills created from task executions are informed by trajectory analysis.

  • Future Hermes versions get better: Every session you run contributes to the training data pool. Future releases inherit learnings from your actual workflows.

  • Atropos is transparent: Unlike proprietary RL systems, Atropos is documented and the trajectory data is yours. You can inspect what was captured and what wasn't.

What Atropos Doesn't Do

Important to be clear about the boundaries:

  • Atropos doesn't modify your running Hermes. Your agent doesn't suddenly get smarter mid-session. The improvement happens in future training runs, not in real-time.

  • Trajectories are stored locally by default. Your task data doesn't automatically upload anywhere. Research commands save to your local directory.

  • RL training requires technical setup. Running fine-tuning jobs is not trivial. This is research infrastructure, not a one-click button.

  • Better models still matter most. A 79% TAU2 score is meaningful, but the underlying model's general capability still sets the ceiling. The RL pipeline optimizes behavior within that ceiling.

The Privacy Angle

When you run:

hermes research run --task "fix_bugs" --output trajectories/

The trajectory file is written to your local trajectories/ directory. It contains your conversation, the agent's tool calls, and the outcome. If you don't share it, nobody sees it.

When you run normal Hermes sessions (without the research command), no trajectory data is exported or shared. The Atropos integration is opt-in per session.

FAQ

Do I need to run RL commands to use Hermes?

No. The research commands are entirely optional. Normal Hermes usage benefits from the Atropos infrastructure without any explicit research setup.

Can I see what trajectories are captured?

Yes — run hermes research run --inspect to see what would be captured without actually saving it.

How is this different from OpenAI's RLHF or Claude's Constitutional AI?

Those systems train the base model globally — everyone's data contributes to a shared model. Atropos in Hermes trains on trajectories from your usage — the improvement is specific to the agentic behaviors you're exercising. You also own your trajectory data.

What models benefit most from Hermes trajectory training?

Models already strong at tool use and reasoning (Qwen 3.5, GLM, Gemma) benefit most. The RL pipeline optimizes agentic behavior, not general knowledge.

Self-Improving Guide | Skills Guide | Memory System

flyhermes.ai

Frequently Asked Questions

What is TerminalBench and why does it matter for choosing a Hermes model?

TerminalBench v2 measures how well models execute multi-step tasks using command-line tools. TAU2 measures tool agent understanding — planning, tool selection, error recovery, and completion. These benchmarks are harder to game than standard LLM benchmarks because they measure actual behavior.

How does Hermes generate training data for reinforcement learning?

Every task execution captures a trajectory — every tool call, decision, outcome, and correction. Run `hermes research run --task 'task_name' --output trajectories/` to collect these. The data exports to ShareGPT format, compatible with fine-tuning pipelines like Axolotl and LLaMA Factory.

How is Hermes RL different from OpenAI's RLHF or Claude's Constitutional AI?

OpenAI and Anthropic's RL systems train shared global models — everyone's data contributes to improving the same base model. Hermes with Atropos trains on trajectories from your specific usage — the improvement is behavioral and specific to agentic workflows. You also own your trajectory data.

Do I need to run research commands to benefit from the RL pipeline?

No. Normal Hermes usage generates the underlying trajectory data automatically. Research commands just give you access to that data for inspection or fine-tuning. Regular users benefit through better model recommendations (informed by TAU2/TerminalBench scores) and continuously improving skill documents.

What models benefit most from Hermes trajectory training?

Models already strong at tool use and reasoning — Qwen 3.5, GLM, Gemma — benefit most from Hermes trajectory training. The RL pipeline optimizes agentic behavior within the model's existing capability ceiling, so starting with a strong base model matters significantly.

Ready to Run Your Own AI Agent?

Self-host Hermes in 60 seconds. No credit card, no cloud lock-in.

Deploy Hermes Free →

Related Posts