Voice mode transforms Hermes into a spoken assistant. Speak your commands instead of typing, hear responses read aloud, and have voice messages automatically transcribed on messaging platforms.
Components
Speech-to-Text (STT): Your voice → Text commands Text-to-Speech (TTS): Hermes responses → Audio
STT Providers
| Provider | Speed | Cost | Privacy |
|---|---|---|---|
| Local (faster-whisper) | Good | Free | Full |
| Groq | Excellent | Free tier | Cloud |
| OpenAI Whisper | Good | Paid | Cloud |
| Mistral Voxtral | Good | Paid | Cloud |
TTS Providers
| Provider | Quality | Cost |
|---|---|---|
| Edge TTS | Good | Free |
| ElevenLabs | Excellent | Paid |
| OpenAI TTS | Excellent | Paid |
| MiniMax | Good | Paid |
| Mistral | Good | Paid |
Quick Setup: Free Local Voice
1. Install voice dependencies:
pip install 'hermes-agent[voice]'
2. Enable in config:
stt:
enabled: true
provider: local
local:
model: base # tiny | base | small | medium | large-v3
3. Use in CLI:
/voice on
# Press Ctrl+B to record
# Speak your command
# Release to send
Voice on Messaging Platforms
Voice messages are auto-transcribed:
[Telegram voice note] → "Hey Hermes, check the server status"
Hermes processes the text normally, responds with text.
To enable TTS responses on messaging:
tts:
enabled: true
provider: edge # Free Microsoft voices
Model Size vs Quality
| Model | Size | VRAM | Quality |
|---|---|---|---|
| tiny | 39M | ~1GB | Fast, less accurate |
| base | 74M | ~1GB | Balanced |
| small | 244M | ~2GB | Better accuracy |
| medium | 769M | ~5GB | Good quality |
| large-v3 | 1550M | ~10GB | Best quality |
| turbo | 809M | ~6GB | Fast + good |
Language Detection
Auto-detect language:
stt:
local:
model: base
# language: "" # Leave empty for auto-detect
Force specific language:
stt:
local:
model: base
language: "en" # or "es", "fr", "de", etc.
Cloud STT with Groq (Free)
# ~/.hermes/.env
GROQ_API_KEY=gsk_xxx
stt:
enabled: true
provider: groq # Uses Whisper via Groq's free tier
Premium TTS with ElevenLabs
ELEVENLABS_API_KEY=el_xxx
tts:
enabled: true
provider: elevenlabs
Troubleshooting
"No audio input detected"
- Check microphone permissions
- Verify default input device
- In CLI, ensure Ctrl+B triggers recording
"STT transcription inaccurate"
- Use larger model:
model: smallormodel: large-v3 - Force language if known:
language: "en"
"TTS sounds robotic"
- Switch from Edge TTS to ElevenLabs or OpenAI
- These have more natural voices