Feature

Reinforcement Learning — Train Hermes on Your Feedback

Quick answer

Hermes integrates the Atropos reinforcement-learning pipeline for continuous self-improvement, with RLHF and DPO support and custom reward functions. It's the advanced, training-side path to shaping agent behavior from feedback — distinct from the everyday learning loop that builds skills.

Key Points

✓Atropos RL pipeline built-in
✓Learn from user feedback
✓RLHF and DPO support
✓Custom reward functions
✓Improve task-specific performance
✓Works with local and cloud models

How It Works

1Hermes collects interaction data
2Rate responses or let auto-evaluation run
3Atropos pipeline trains on feedback
4Model improves on your specific use cases

Real-World Use Cases

Task-Specific Fine-Tuning

If Hermes consistently makes a specific type of error on your domain tasks — misformatting outputs, missing domain conventions, using the wrong tool — rate those interactions negatively. The Atropos pipeline uses that signal to steer the model away from those patterns.

Custom Reward Functions

Define what 'good' means for your use case programmatically: code that passes tests, documents under a word limit, API calls that return 200. Write a reward function; the RL pipeline optimizes against it automatically.

ShareGPT Export for Custom Models

Export your interaction trajectories in ShareGPT format for fine-tuning your own models. If you use Hermes heavily in a specialized domain, the trajectory dataset captures expert-level task execution that can train a domain-specific model.

Research Trajectory Generation

Generate large batch trajectory datasets with parallel workers and checkpointing. Hermes is designed as a research platform for training the next generation of tool-calling models — the RL infrastructure is not bolted on, it's built in.

Under the Hood

The Atropos integration in Hermes implements the full RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) pipelines. In RLHF mode, Hermes collects interaction trajectories — full sequences of observations, tool calls, and responses — and presents them for human rating. Ratings are used to train a reward model, which then guides the policy model via PPO or similar on-policy algorithms. In DPO mode, pairs of preferred/rejected trajectories are collected and used directly for offline optimization, which is more sample-efficient for smaller iteration cycles.

The trajectory collection layer is always active: every Hermes session generates structured trajectory data stored in the SQLite database. Batch trajectory generation mode runs the agent headlessly against a task set with parallel workers (configurable concurrency) and checkpointing, so large-scale data collection doesn't require babysitting. Trajectory compression reduces storage overhead for long sessions with many tool calls without losing the structural information needed for training.

For teams and researchers, Hermes's RL infrastructure provides a complete loop: deploy the agent on real tasks, collect trajectories, filter and rate the interesting ones, run the RL pipeline, update the model, redeploy. This loop works with both cloud models (using the API) and local models via Ollama or vLLM — the RL pipeline is model-agnostic by design, matching the rest of the Hermes architecture.

Frequently asked questions

What is the Atropos RL pipeline in Hermes?

Atropos is Nous Research's reinforcement-learning framework, integrated for training-side self-improvement with RLHF, DPO, and custom reward functions — for shaping model behavior from feedback.

Is RL the same as Hermes' learning loop?

No. The learning loop builds reusable skills from your usage day to day. The Atropos RL pipeline is the heavier, training-side path for tuning behavior — most users rely on skills, not RL.

Do I need to use reinforcement learning?

No. It's an advanced capability for users who train or fine-tune models. The everyday agent gets better through memory and skills without any RL setup.

Related Features

learning loop model agnostic local llm support

Try Hermes Free → Deploy in 60 seconds