Introduction
Talking to an AI agent with your voice, hearing its replies, having a natural conversation — this is no longer science fiction. Hermes Agent's voice mode makes bidirectional oral conversation a reality. Whether you are at the CLI, on Telegram, or in a Discord voice channel, Hermes turns text-based interaction into a genuine spoken dialogue.
This article covers installation, configuration, and practical usage of Hermes Agent voice mode: speech recognition (STT), text-to-speech (TTS), and real-world use cases. If you haven't installed Hermes Agent yet, start with the complete installation guide.
Installing Voice Mode
System Dependencies
Before enabling voice, make sure you have the audio dependencies installed:
- PortAudio — for microphone capture and audio playback in the CLI
- ffmpeg — for audio format conversion (MP3 to Opus, PCM to WAV)
- opus — audio codec required for Discord
- espeak-ng — phonemizer for optional local TTS (NeuTTS)
# macOS
brew install portaudio ffmpeg opus espeak-ng
# Ubuntu / Debian
sudo apt install portaudio19-dev ffmpeg libopus0 espeak-ng
# Fedora
sudo dnf install portaudio-devel ffmpeg opus espeak-ng
Python Installation
Hermes Agent provides a dedicated voice extra that installs sounddevice and numpy:
pip install "hermes-agent[voice]"
If you use messaging (Telegram, Discord), the messaging extra is sufficient — it includes discord.py[voice] and python-telegram-bot:
pip install "hermes-agent[messaging]"
To install everything at once:
pip install "hermes-agent[all]"
Option: Local Whisper
For fully local speech recognition with zero API keys, install faster-whisper:
pip install faster-whisper
The model (approximately 150 MB for base) downloads automatically on first use. Hermes then works with zero API keys for speech recognition.
Activating Voice Mode in the CLI
Launch and Commands
Start the CLI and enable voice mode:
hermes # Start the interactive CLI
Inside the CLI, use these slash commands:
/voice— toggle voice mode on/off/voice on— enable voice mode (voice reply only when you send a voice message)/voice tts— enable speech synthesis for all messages (text + voice)/voice off— disable voice replies/voice status— show current state
Recording with Ctrl+B
Here is how a voice conversation unfolds in the CLI:
- Enable voice mode with
/voice on - Press Ctrl+B — a beep plays (880 Hz), recording starts
- Speak — a live audio level bar shows your input:
● [▁▂▃▅▇▇▅▂] ❯ - Stop speaking — after 3 seconds of silence, recording auto-stops
- Two beeps (660 Hz) confirm the end of recording
- Audio is transcribed via Whisper and sent to the agent
- If TTS is enabled, the agent's reply is spoken aloud
- Recording automatically restarts — speak again without pressing any key
This loop continues until you press Ctrl+B during recording (exits continuous mode) or 3 consecutive recordings detect no speech.
Tip: the record key is configurable via
voice.record_keyin~/.hermes/config.yaml(default:ctrl+b).
Silence Detection
The detection algorithm works in two stages:
- Speech confirmation — waits for audio above the RMS threshold (200) for at least 0.3 seconds, tolerating brief dips between syllables
- End detection — once speech is confirmed, triggers after 3 seconds of continuous silence
If no speech is detected for 15 seconds, recording stops automatically. These parameters are configurable: silence_threshold and silence_duration in the configuration file. Beeps can be disabled with voice.beep_enabled: false.
Streaming TTS
When TTS is enabled, the agent speaks its reply sentence by sentence as it generates — you don't wait for the full response. The system:
- Buffers text deltas into complete sentences (minimum 20 characters)
- Strips markdown formatting and code blocks
- Generates and plays audio per sentence in real time
Whisper Hallucination Filter
Whisper sometimes generates phantom text from silence or background noise ("Thank you for watching", "Subscribe", etc.). Hermes automatically filters these artifacts using a set of 26 known hallucination phrases across multiple languages, plus a regex pattern that catches repetitive variations.
Speech Recognition (STT)
Hermes Agent supports multiple STT providers with automatic fallback:
| Provider | Model | Speed | Quality | Cost | API Key |
|---|---|---|---|---|---|
| Local (faster-whisper) | base |
Fast (CPU/GPU) | Good | Free | No |
| Local (faster-whisper) | small |
Medium | Better | Free | No |
| Local (faster-whisper) | large-v3 |
Slow | Best | Free | No |
| Groq | whisper-large-v3-turbo |
Very fast (~0.5s) | Good | Free tier | Yes |
| Groq | whisper-large-v3 |
Fast (~1s) | Better | Free tier | Yes |
| OpenAI | whisper-1 |
Fast (~1s) | Good | Paid | Yes |
| OpenAI | gpt-4o-transcribe |
Medium (~2s) | Best | Paid | Yes |
Automatic fallback priority: local → groq → openai.
STT Configuration
In ~/.hermes/config.yaml:
stt:
provider: "local"
local:
model: "base" # tiny, base, small, medium, large-v3
API keys in ~/.hermes/.env:
GROQ_API_KEY=your-key
VOICE_TOOLS_OPENAI_KEY=your-key
Model overrides:
STT_GROQ_MODEL=whisper-large-v3-turbo
STT_OPENAI_MODEL=whisper-1
Text-to-Speech (TTS)
Hermes Agent supports ten TTS providers, from free to premium:
| Provider | Quality | Cost | Latency | API Key |
|---|---|---|---|---|
| Edge TTS (default) | Good | Free | ~1s | No |
| ElevenLabs | Excellent | Paid | ~2s | Yes |
| OpenAI TTS | Good | Paid | ~1.5s | Yes |
| MiniMax TTS | Excellent | Paid | ~1.5s | Yes |
| Mistral (Voxtral) | Excellent | Paid | ~2s | Yes |
| Google Gemini TTS | Excellent | Free tier | ~1s | Yes |
| xAI TTS | Excellent | Paid | ~1s | Yes |
| NeuTTS | Good | Free (local) | Variable | No |
| KittenTTS | Good | Free (local) | Variable | No |
| Piper | Good | Free (local) | Variable | No |
TTS Configuration
tts:
provider: "edge"
speed: 1.0
edge:
voice: "en-US-AriaNeural" # 322 voices, 74 languages
speed: 1.0
openai:
model: "gpt-4o-mini-tts"
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer
speed: 1.0
elevenlabs:
voice_id: "pNInz6obpgDQGcFmaJgB"
model_id: "eleven_multilingual_v2"
minimax:
model: "speech-2.8-hd"
voice_id: "English_Graceful_Lady"
xai:
voice_id: "eve"
language: "en"
Speed control: each provider can override the global multiplier. The hierarchy is: provider-specific speed → global tts.speed → 1.0 default.
Custom Command Providers
You can integrate any external TTS engine (VoxCPM, MLX-Kokoro, XTTS CLI) via a command-type provider in config.yaml. Hermes writes text to a temp file, runs your command, and reads the audio output — no Python needed.
Telegram Voice Messages
If you have already connected Telegram to Hermes Agent, voice messages work immediately with no additional configuration.
Receiving Voice Messages
Send a voice message to your Telegram bot — Hermes transcribes it automatically via Whisper and injects the transcript as text into the conversation. The agent sees the transcript as a normal message.
Sending Voice Replies
Enable voice mode in Telegram:
/voice on— voice reply only when you send a voice message/voice tts— voice reply for all messages/voice off— text only mode (default)
Delivery format: voice replies are sent as native Opus/OGG voice bubbles that play inline in chat. If ffmpeg is not installed, MP3-producing providers send a regular audio file instead.
ffmpeg Note
Some providers produce Opus natively (OpenAI, ElevenLabs, Mistral) — no conversion needed. Others like Edge TTS, MiniMax, xAI, NeuTTS, KittenTTS and Piper require ffmpeg for conversion to the Opus/OGG format expected by Telegram.
Tip: if you don't want to install ffmpeg, switch to OpenAI or ElevenLabs which produce native Opus.
Platform Compatibility
Voice mode works differently across platforms:
CLI
Full voice interaction with Ctrl+B for recording and real-time audio playback. Works in both the classic CLI (hermes chat) and the TUI (hermes --tui). See the CLI mastery guide for more details.
Telegram
- Automatic send and receive of voice messages
- Native Opus/OGG voice bubbles
- Automatic STT transcription of received voice messages
- TTS replies sent as audio
Discord
- Voice messages in DMs and text channels (with @mention)
- Voice channels: the bot joins the channel, listens to users, transcribes and speaks replies aloud
- Voice channels require Connect + Speak + Use Voice Activity permissions
- Automatic echo prevention: the bot mutes its listener during TTS playback
Known Limitations
- WhatsApp: voice replies are sent as MP3 files (no native voice bubble)
- Signal: no streaming or message editing — voice replies are sent as attachments
- Local TTS (NeuTTS, KittenTTS, Piper) depends on your machine's CPU/GPU performance
Real-World Use Cases
Accessibility
Voice mode makes Hermes Agent accessible to users with keyboard difficulties. Whether for navigation, writing, or reviewing results, voice offers a natural and efficient alternative.
Mobility
On the go, Telegram voice messages let you interact with the agent without typing. Perfect for requesting a meeting summary, checking quick information, or delegating a task to the agent during a commute.
Tutorials and Podcasts
Hermes can transform an article into an audio file using TTS, create voice summaries of documents, or serve as a base for audio content. Combine this with the Skills system to automate voice content production.
Meetings and Note-Taking
With continuous voice recognition in the CLI, you can dictate notes, have the agent structure them, and get a clean summary — all by voice.
Complete Reference Configuration
Example ~/.hermes/config.yaml for voice mode:
voice:
record_key: "ctrl+b"
max_recording_seconds: 120
auto_tts: false
beep_enabled: true
silence_threshold: 200
silence_duration: 3.0
stt:
provider: "local"
local:
model: "base"
tts:
provider: "edge"
speed: 1.0
edge:
voice: "en-US-AriaNeural"
speed: 1.0
API keys in ~/.hermes/.env:
GROQ_API_KEY=your-key
VOICE_TOOLS_OPENAI_KEY=your-key
ELEVENLABS_API_KEY=your-key
Common Troubleshooting
"No audio device found" (CLI)
PortAudio is not installed:
brew install portaudio # macOS
sudo apt install portaudio19-dev # Ubuntu
Discord bot doesn't respond in server channels
The bot requires an @mention by default in server channels. Select the bot user (with the #discriminator), not the role with the same name. Or use DMs. You can also disable the requirement:
DISCORD_REQUIRE_MENTION=false
Discord bot can't hear me in voice channel
- Verify your Discord user ID is in
DISCORD_ALLOWED_USERS - Make sure you are not muted
- The bot needs a SPEAKING event from Discord — speak within a few seconds of joining
Whisper returns garbage text
The hallucination filter catches most cases. If issues persist:
- Use a quieter environment
- Increase
silence_thresholdin config (higher = less sensitive) - Try a different STT model (switch from
basetosmallorlarge-v3)
TTS produces no audio
- Check the TTS provider API key and quota
- Edge TTS (free, no key) is the default fallback
- Check logs:
tail -f ~/.hermes/logs/gateway.log
Telegram voice messages appear as files, not bubbles
Install ffmpeg:
sudo apt install ffmpeg
Or switch to a TTS provider that produces native Opus (OpenAI, ElevenLabs, Mistral).
Conclusion
✅ Conclusion
Hermes Agent voice mode radically transforms how you interact with your AI assistant. With simple installation (pip install "hermes-agent[voice]"), free providers like Edge TTS and faster-whisper, and broad compatibility across CLI, Telegram, and Discord, it has never been easier to talk to your assistant.
The configuration flexibility — from STT/TTS provider selection to silence detection parameters — adapts voice mode to every context: accessibility, mobility, audio content production, or simply the convenience of oral conversation. To extend this setup, explore the multi-platform gateway that connects Hermes to all your messaging channels.