Feature

Voice Mode — Speak and Listen with Hermes

Quick answer

Hermes voice mode adds full speech interaction: speech-to-text for voice input and text-to-speech for spoken responses, including automatic transcription of Telegram voice messages. Multiple voice engines are supported, with cloud options for quality or local engines for a fully private, offline pipeline.

Key Points

✓STT (speech-to-text) for voice input
✓TTS (text-to-speech) for spoken responses
✓Works on Telegram voice messages
✓Multiple voice engines supported
✓Hands-free operation
✓Natural conversation flow

How It Works

1Send a voice message on Telegram or Discord
2Hermes transcribes with Whisper or cloud STT
3Processes your request with full context
4Responds with synthesized speech via ElevenLabs or local TTS

Real-World Use Cases

Mobile Voice Commands

Send Hermes voice memos from your phone's Telegram. 'Add a task to check on the server deployment' or 'What's on my schedule tomorrow?' — full context, full tool access, spoken or text response as you prefer.

Dictated Content Creation

Speak your rough ideas and Hermes produces polished written output. The transcription + LLM pipeline handles filler words, rephrasing, and formatting automatically — faster than typing for long-form content.

Discord Voice Channel Assistant

Invite Hermes into a Discord voice channel for real-time voice conversations. The agent participates in the channel, responds to questions, runs skills, and provides information without anyone needing to type.

Accessibility-First Interaction

For users with limited keyboard access, voice mode provides full Hermes capability through speech alone. Every tool, skill, and automation is accessible via voice on Telegram, Discord, or CLI.

Under the Hood

Hermes voice mode connects three pipeline stages: transcription, reasoning, and synthesis. Transcription uses OpenAI Whisper by default (running locally or via API) with cloud STT options available as alternatives. Whisper's multilingual support means Hermes handles voice input in 90+ languages without configuration — you speak, it transcribes, regardless of accent or language mixing.

The transcribed text flows into the standard Hermes reasoning pipeline with full access to memory, tools, and skills. This is the key distinction from simple voice assistants: Hermes isn't just answering trivia, it's executing multi-step workflows in response to voice commands. 'Check the deployment logs and tell me if anything failed overnight' triggers real tool calls — SSH, log parsing, analysis — not a web search.

Speech synthesis uses ElevenLabs for natural-sounding output with configurable voice profiles, or local TTS engines for fully offline operation. Response length is automatically adjusted for voice delivery — Hermes knows that reading a 500-word analysis aloud is not useful, so it provides a spoken summary with the full text available as follow-up. On Telegram, voice responses are delivered as audio files; on Discord VC, they're streamed directly into the voice channel in real time.

Frequently asked questions

How does voice mode work in Hermes?

A built-in STT + TTS pipeline: you speak (push-to-talk in the CLI, or voice notes on channels like Telegram), Hermes transcribes it, and responses can be spoken back.

Can voice mode run privately?

Yes. Use local speech engines for a fully offline, private pipeline, or cloud engines when you want higher quality and lower latency.

Does voice work over messaging apps?

Yes — Hermes automatically transcribes voice messages on Telegram and other channels, so you can talk to the agent from your phone.

Related Features

multi platform local llm support code execution

Try Hermes Free → Deploy in 60 seconds