Voice Mode — Speak and Listen with Hermes
Key Points
- ✓STT (speech-to-text) for voice input
- ✓TTS (text-to-speech) for spoken responses
- ✓Works on Telegram voice messages
- ✓Multiple voice engines supported
- ✓Hands-free operation
- ✓Natural conversation flow
How It Works
- 1Send a voice message on Telegram or Discord
- 2Hermes transcribes with Whisper or cloud STT
- 3Processes your request with full context
- 4Responds with synthesized speech via ElevenLabs or local TTS
Real-World Use Cases
Mobile Voice Commands
Send Hermes voice memos from your phone's Telegram. 'Add a task to check on the server deployment' or 'What's on my schedule tomorrow?' — full context, full tool access, spoken or text response as you prefer.
Dictated Content Creation
Speak your rough ideas and Hermes produces polished written output. The transcription + LLM pipeline handles filler words, rephrasing, and formatting automatically — faster than typing for long-form content.
Discord Voice Channel Assistant
Invite Hermes into a Discord voice channel for real-time voice conversations. The agent participates in the channel, responds to questions, runs skills, and provides information without anyone needing to type.
Accessibility-First Interaction
For users with limited keyboard access, voice mode provides full Hermes capability through speech alone. Every tool, skill, and automation is accessible via voice on Telegram, Discord, or CLI.
Under the Hood
Hermes voice mode connects three pipeline stages: transcription, reasoning, and synthesis. Transcription uses OpenAI Whisper by default (running locally or via API) with cloud STT options available as alternatives. Whisper's multilingual support means Hermes handles voice input in 90+ languages without configuration — you speak, it transcribes, regardless of accent or language mixing.
The transcribed text flows into the standard Hermes reasoning pipeline with full access to memory, tools, and skills. This is the key distinction from simple voice assistants: Hermes isn't just answering trivia, it's executing multi-step workflows in response to voice commands. 'Check the deployment logs and tell me if anything failed overnight' triggers real tool calls — SSH, log parsing, analysis — not a web search.
Speech synthesis uses ElevenLabs for natural-sounding output with configurable voice profiles, or local TTS engines for fully offline operation. Response length is automatically adjusted for voice delivery — Hermes knows that reading a 500-word analysis aloud is not useful, so it provides a spoken summary with the full text available as follow-up. On Telegram, voice responses are delivered as audio files; on Discord VC, they're streamed directly into the voice channel in real time.