Voice Mode Setup — Speak to Hermes, Hear Responses

Q: Do I need text-to-speech to use voice mode?

No. You can use speech-to-text only and receive normal text replies.

Q: Which transcription provider should I start with?

Start local for privacy and cost. Use cloud transcription if setup simplicity or accuracy is more important.

Q: Can voice mode work over Telegram?

Yes. Once the Telegram gateway and STT are configured, Hermes can transcribe Telegram voice notes.

Q: Should I use voice for destructive commands?

Use confirmation for destructive commands because voice transcription can mishear paths or package names.

Hermes Agent voice mode lets you talk to your agent instead of typing every prompt. That matters when Hermes runs as a real assistant across your terminal and messaging platforms: you can send a voice note, get the task done, and optionally hear a spoken response.

Quick answer#

To use voice mode, enable speech-to-text for incoming audio, choose a transcription provider, optionally configure text-to-speech, then test from the Hermes CLI or a messaging gateway such as Telegram. Local transcription is best for privacy and cost. Cloud transcription is simpler when local dependencies are not installed.

If you are starting from scratch, first install Hermes Agent, confirm the gateway works, then add voice. Do not debug audio and messaging setup at the same time.

What Voice Mode Includes#

Voice mode has two parts:

Speech-to-text: Hermes transcribes your spoken message into text.
Text-to-speech: Hermes can return spoken audio instead of only text.

You can use one without the other. Many users start with voice input only because it is faster than typing long instructions from a phone.

Recommended Setup Order#

Verify Hermes works in the CLI.
Verify your messaging platform works, usually Telegram setup or Discord.
Enable STT.
Send one short voice note.
Add TTS only after transcription is reliable.

This prevents the common “everything is broken” debugging loop where the real issue is platform routing, not audio.

Provider Choices#

Use local faster-whisper when you want privacy, no per-minute API cost, and good enough accuracy on your own machine.

Use Groq/OpenAI-style cloud transcription when you want easy setup, faster transcription on weak hardware, or better accuracy with noisy audio.

Use Edge/OpenAI/ElevenLabs-style TTS when spoken replies matter. For most operations workflows, text replies are still easier to scan.

Practical Test Prompt#

Send a short voice note:

Make a three-bullet summary of today's highest-priority Hermes Agent website task. Keep it short.

Then verify:

Was the transcription accurate?
Did Hermes route it to the right chat/thread?
Did the agent understand it as an instruction?
Was the response format usable on mobile?

If that works, test a longer command that includes a file path or URL.

Voice Notes in Telegram#

Voice notes are especially useful when Hermes is connected to Telegram. You can dictate a task from your phone, let Hermes use tools, and receive the result in the same thread.

For production use, keep commands explicit. Voice transcription can mangle paths and package names, so ask Hermes to confirm exact commands before destructive changes.

Good voice commands for agents#

Voice works best when the request is natural but bounded. Say what you want done, where the source of truth is, and what proof you expect back. “Check the site” is vague. “Open the latest landing page, test the CTA on mobile, and send me a screenshot if anything blocks the flow” is usable.

For coding or filesystem tasks, voice should be paired with confirmation. Paths, package names, and flags are easy to mis-transcribe. A safe Hermes workflow is: dictate the goal, let Hermes inspect, then require it to show the exact command or patch before risky changes.

Mobile assistant pattern#

The strongest use case is phone-first operations: send a voice note in Telegram, Hermes uses tools on the computer, and you get a concise result back in the same thread. That is faster than opening a laptop for every small operational task.

Troubleshooting#

Voice messages arrive but are not transcribed#

Check that STT is enabled, the provider has its required dependency or API key, and the gateway process was restarted after config changes.

Transcription works in CLI but not Telegram#

That usually means the messaging gateway is using a different profile/config or an old process. Restart the gateway and verify the active profile.

Replies are too verbose for audio#

Tell Hermes the voice response style. Example: “Answer in under 20 seconds, then include details in text.”

Private audio should not leave the machine#

Use local transcription and avoid cloud TTS. If privacy is central, run from self-hosted Hermes with profile-specific credentials.

Best Voice Workflows#

Dictating tasks while away from keyboard
Capturing quick ideas into persistent agent context
Sending operational instructions to a Telegram bot
Hands-free status checks
Accessibility support for long agent sessions

Voice is most valuable when paired with persistent memory, because Hermes can interpret shorthand based on your stable preferences and project context.

FAQ#

Do I need text-to-speech to use voice mode? No. You can use voice input only and receive normal text replies.

Which transcription provider should I start with? Start local if privacy/cost matter. Use a cloud provider if you want simpler setup or stronger accuracy on weak hardware.

Can voice mode work over Telegram? Yes. Once Telegram gateway and STT are configured, Hermes can transcribe Telegram voice notes and answer in the same chat.

Should I use voice for destructive commands? Be careful. Voice can mishear paths or commands. Require confirmation before destructive actions.

Voice Mode: Hands-Free AI Assistance

Quick answer#

What Voice Mode Includes#

Recommended Setup Order#

Provider Choices#

Practical Test Prompt#

Voice Notes in Telegram#

Good voice commands for agents#

Mobile assistant pattern#

Troubleshooting#

Voice messages arrive but are not transcribed#

Transcription works in CLI but not Telegram#

Replies are too verbose for audio#

Private audio should not leave the machine#

Best Voice Workflows#

FAQ#

Frequently Asked Questions

Do I need text-to-speech to use voice mode?

Which transcription provider should I start with?

Can voice mode work over Telegram?

Should I use voice for destructive commands?

FlyHermes (Managed Cloud)

Self-Host (Open Source)

Related Hermes Agent guides