Voice Mode: Hands-Free AI Assistance

·hermes voice mode speech tts sttvoicettssttspeech

Enable voice interaction with Hermes Agent. Speech-to-text transcription, text-to-speech responses, and voice message support across platforms.

Voice mode transforms Hermes into a spoken assistant. Speak your commands instead of typing, hear responses read aloud, and have voice messages automatically transcribed on messaging platforms.

Components

Speech-to-Text (STT): Your voice → Text commands Text-to-Speech (TTS): Hermes responses → Audio

STT Providers

Provider Speed Cost Privacy
Local (faster-whisper) Good Free Full
Groq Excellent Free tier Cloud
OpenAI Whisper Good Paid Cloud
Mistral Voxtral Good Paid Cloud

TTS Providers

Provider Quality Cost
Edge TTS Good Free
ElevenLabs Excellent Paid
OpenAI TTS Excellent Paid
MiniMax Good Paid
Mistral Good Paid

Quick Setup: Free Local Voice

1. Install voice dependencies:

pip install 'hermes-agent[voice]'

2. Enable in config:

stt:
  enabled: true
  provider: local
  local:
    model: base  # tiny | base | small | medium | large-v3

3. Use in CLI:

/voice on
# Press Ctrl+B to record
# Speak your command
# Release to send

Voice on Messaging Platforms

Voice messages are auto-transcribed:

[Telegram voice note] → "Hey Hermes, check the server status"

Hermes processes the text normally, responds with text.

To enable TTS responses on messaging:

tts:
  enabled: true
  provider: edge  # Free Microsoft voices

Model Size vs Quality

Model Size VRAM Quality
tiny 39M ~1GB Fast, less accurate
base 74M ~1GB Balanced
small 244M ~2GB Better accuracy
medium 769M ~5GB Good quality
large-v3 1550M ~10GB Best quality
turbo 809M ~6GB Fast + good

Language Detection

Auto-detect language:

stt:
  local:
    model: base
    # language: ""  # Leave empty for auto-detect

Force specific language:

stt:
  local:
    model: base
    language: "en"  # or "es", "fr", "de", etc.

Cloud STT with Groq (Free)

# ~/.hermes/.env
GROQ_API_KEY=gsk_xxx
stt:
  enabled: true
  provider: groq  # Uses Whisper via Groq's free tier

Premium TTS with ElevenLabs

ELEVENLABS_API_KEY=el_xxx
tts:
  enabled: true
  provider: elevenlabs

Troubleshooting

"No audio input detected"

  • Check microphone permissions
  • Verify default input device
  • In CLI, ensure Ctrl+B triggers recording

"STT transcription inaccurate"

  • Use larger model: model: small or model: large-v3
  • Force language if known: language: "en"

"TTS sounds robotic"

  • Switch from Edge TTS to ElevenLabs or OpenAI
  • These have more natural voices

Related Guides

Frequently Asked Questions

Does voice mode work offline?

Local STT (faster-whisper) works offline. Cloud providers need internet. Edge TTS needs internet but is free.

Can I use voice in Discord/Slack?

Voice message transcription works on all platforms. TTS output as audio messages depends on platform support.

How much VRAM does local STT need?

The 'base' model needs ~1GB. 'large-v3' needs ~10GB. Most laptops can run 'base' or 'small' comfortably.

FlyHermes (Managed Cloud)

Deploy in 60 seconds. API costs included. Cancel anytime.

$29.50/first month →

Self-Host (Open Source)

Full control. MIT licensed. Run on your own infrastructure.

View install guide →

Related Posts