Nous ResearchHermes Agent
Deploy Now

Voice Mode — Speak and Listen with Hermes

Key Points

  • STT (speech-to-text) for voice input
  • TTS (text-to-speech) for spoken responses
  • Works on Telegram voice messages
  • Multiple voice engines supported
  • Hands-free operation
  • Natural conversation flow

How It Works

  1. 1Send a voice message on Telegram or Discord
  2. 2Hermes transcribes with Whisper or cloud STT
  3. 3Processes your request with full context
  4. 4Responds with synthesized speech via ElevenLabs or local TTS

Real-World Use Cases

Mobile Voice Commands

Send Hermes voice memos from your phone's Telegram. 'Add a task to check on the server deployment' or 'What's on my schedule tomorrow?' — full context, full tool access, spoken or text response as you prefer.

Dictated Content Creation

Speak your rough ideas and Hermes produces polished written output. The transcription + LLM pipeline handles filler words, rephrasing, and formatting automatically — faster than typing for long-form content.

Discord Voice Channel Assistant

Invite Hermes into a Discord voice channel for real-time voice conversations. The agent participates in the channel, responds to questions, runs skills, and provides information without anyone needing to type.

Accessibility-First Interaction

For users with limited keyboard access, voice mode provides full Hermes capability through speech alone. Every tool, skill, and automation is accessible via voice on Telegram, Discord, or CLI.

Under the Hood

Hermes voice mode connects three pipeline stages: transcription, reasoning, and synthesis. Transcription uses OpenAI Whisper by default (running locally or via API) with cloud STT options available as alternatives. Whisper's multilingual support means Hermes handles voice input in 90+ languages without configuration — you speak, it transcribes, regardless of accent or language mixing.

The transcribed text flows into the standard Hermes reasoning pipeline with full access to memory, tools, and skills. This is the key distinction from simple voice assistants: Hermes isn't just answering trivia, it's executing multi-step workflows in response to voice commands. 'Check the deployment logs and tell me if anything failed overnight' triggers real tool calls — SSH, log parsing, analysis — not a web search.

Speech synthesis uses ElevenLabs for natural-sounding output with configurable voice profiles, or local TTS engines for fully offline operation. Response length is automatically adjusted for voice delivery — Hermes knows that reading a 500-word analysis aloud is not useful, so it provides a spoken summary with the full text available as follow-up. On Telegram, voice responses are delivered as audio files; on Discord VC, they're streamed directly into the voice channel in real time.

Related Features