Set Up Hermes Voice Mode — Speak and Listen
Enable voice interaction with Hermes — speak your commands, hear the responses with text-to-speech.
Voice mode transforms Hermes into a spoken assistant. Speak your commands instead of typing, and hear responses read aloud. Works in the CLI with push-to-talk and automatically transcribes voice messages on Telegram, Discord, and other platforms.
Managed cloud · API costs included · Skill library · Cancel anytime
Before you start:
- ☑Hermes Agent installed
- ☑Microphone and speakers/headphones
- ☑For local STT: enough RAM for whisper model (~1GB for 'base')
Steps
- 1
Install voice dependencies
pip install 'hermes-agent[voice]' to add faster-whisper and audio libraries
- 2
Configure STT provider
In config.yaml, set stt: enabled: true and choose provider: local, groq, or openai
- 3
Configure TTS provider
Choose a TTS provider: Edge TTS (free), ElevenLabs, OpenAI, MiniMax, or Mistral
- 4
Add API keys if needed
Add GROQ_API_KEY or VOICE_TOOLS_OPENAI_KEY to .env for cloud providers
- 5
Enable voice in CLI
Type /voice on in the CLI, then press Ctrl+B to record
Pro Tips
- 💡Local STT is free and runs on-device — no API key needed, just CPU/RAM
- 💡Groq offers free-tier Whisper with excellent speed — great alternative to local
- 💡Edge TTS is free and sounds natural — best value for TTS
- 💡STT model sizes: tiny (fast, less accurate), base (balanced), large-v3 (most accurate, slow)
Troubleshooting
❌ No audio input detected
✅ Check your microphone permissions and default input device. In CLI, verify Ctrl+B triggers recording.
❌ STT transcription is inaccurate
✅ Try a larger model: stt: local: model: small or stt: local: model: large-v3. More accurate but slower.
❌ TTS voice sounds robotic
✅ Switch to a higher-quality provider like ElevenLabs or OpenAI TTS. Add the API key to .env.
❌ Voice messages not transcribed on Telegram
✅ Ensure stt: enabled: true is set. Also check that the voice file format is supported (opus, ogg, mp3).