Reinforcement Learning — Train Hermes on Your Feedback
Key Points
- ✓Atropos RL pipeline built-in
- ✓Learn from user feedback
- ✓RLHF and DPO support
- ✓Custom reward functions
- ✓Improve task-specific performance
- ✓Works with local and cloud models
How It Works
- 1Hermes collects interaction data
- 2Rate responses or let auto-evaluation run
- 3Atropos pipeline trains on feedback
- 4Model improves on your specific use cases
Real-World Use Cases
Task-Specific Fine-Tuning
If Hermes consistently makes a specific type of error on your domain tasks — misformatting outputs, missing domain conventions, using the wrong tool — rate those interactions negatively. The Atropos pipeline uses that signal to steer the model away from those patterns.
Custom Reward Functions
Define what 'good' means for your use case programmatically: code that passes tests, documents under a word limit, API calls that return 200. Write a reward function; the RL pipeline optimizes against it automatically.
ShareGPT Export for Custom Models
Export your interaction trajectories in ShareGPT format for fine-tuning your own models. If you use Hermes heavily in a specialized domain, the trajectory dataset captures expert-level task execution that can train a domain-specific model.
Research Trajectory Generation
Generate large batch trajectory datasets with parallel workers and checkpointing. Hermes is designed as a research platform for training the next generation of tool-calling models — the RL infrastructure is not bolted on, it's built in.
Under the Hood
The Atropos integration in Hermes implements the full RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) pipelines. In RLHF mode, Hermes collects interaction trajectories — full sequences of observations, tool calls, and responses — and presents them for human rating. Ratings are used to train a reward model, which then guides the policy model via PPO or similar on-policy algorithms. In DPO mode, pairs of preferred/rejected trajectories are collected and used directly for offline optimization, which is more sample-efficient for smaller iteration cycles.
The trajectory collection layer is always active: every Hermes session generates structured trajectory data stored in the SQLite database. Batch trajectory generation mode runs the agent headlessly against a task set with parallel workers (configurable concurrency) and checkpointing, so large-scale data collection doesn't require babysitting. Trajectory compression reduces storage overhead for long sessions with many tool calls without losing the structural information needed for training.
For teams and researchers, Hermes's RL infrastructure provides a complete loop: deploy the agent on real tasks, collect trajectories, filter and rate the interesting ones, run the RL pipeline, update the model, redeploy. This loop works with both cloud models (using the API) and local models via Ollama or vLLM — the RL pipeline is model-agnostic by design, matching the rest of the Hermes architecture.