📑 Table of contents

Hermes Agent Voice Mode: TTS, STT, and Oral Conversations

Hermes Agent 🟡 Intermediate ⏱️ 13 min read 📅 2026-05-05

Introduction

Talking to an AI agent with your voice, hearing its replies, having a natural conversation — this is no longer science fiction. Hermes Agent's voice mode makes bidirectional oral conversation a reality. Whether you are at the CLI, on Telegram, or in a Discord voice channel, Hermes turns text-based interaction into a genuine spoken dialogue.

This article covers installation, configuration, and practical usage of Hermes Agent voice mode: speech recognition (STT), text-to-speech (TTS), and real-world use cases. If you haven't installed Hermes Agent yet, start with the complete installation guide.

Installing Voice Mode

System Dependencies

Before enabling voice, make sure you have the audio dependencies installed:

  • PortAudio — for microphone capture and audio playback in the CLI
  • ffmpeg — for audio format conversion (MP3 to Opus, PCM to WAV)
  • opus — audio codec required for Discord
  • espeak-ng — phonemizer for optional local TTS (NeuTTS)
# macOS
brew install portaudio ffmpeg opus espeak-ng

# Ubuntu / Debian
sudo apt install portaudio19-dev ffmpeg libopus0 espeak-ng

# Fedora
sudo dnf install portaudio-devel ffmpeg opus espeak-ng

Python Installation

Hermes Agent provides a dedicated voice extra that installs sounddevice and numpy:

pip install "hermes-agent[voice]"

If you use messaging (Telegram, Discord), the messaging extra is sufficient — it includes discord.py[voice] and python-telegram-bot:

pip install "hermes-agent[messaging]"

To install everything at once:

pip install "hermes-agent[all]"

Option: Local Whisper

For fully local speech recognition with zero API keys, install faster-whisper:

pip install faster-whisper

The model (approximately 150 MB for base) downloads automatically on first use. Hermes then works with zero API keys for speech recognition.

Activating Voice Mode in the CLI

Launch and Commands

Start the CLI and enable voice mode:

hermes              # Start the interactive CLI

Inside the CLI, use these slash commands:

  • /voice — toggle voice mode on/off
  • /voice on — enable voice mode (voice reply only when you send a voice message)
  • /voice tts — enable speech synthesis for all messages (text + voice)
  • /voice off — disable voice replies
  • /voice status — show current state

Recording with Ctrl+B

Here is how a voice conversation unfolds in the CLI:

  1. Enable voice mode with /voice on
  2. Press Ctrl+B — a beep plays (880 Hz), recording starts
  3. Speak — a live audio level bar shows your input: ● [▁▂▃▅▇▇▅▂] ❯
  4. Stop speaking — after 3 seconds of silence, recording auto-stops
  5. Two beeps (660 Hz) confirm the end of recording
  6. Audio is transcribed via Whisper and sent to the agent
  7. If TTS is enabled, the agent's reply is spoken aloud
  8. Recording automatically restarts — speak again without pressing any key

This loop continues until you press Ctrl+B during recording (exits continuous mode) or 3 consecutive recordings detect no speech.

Tip: the record key is configurable via voice.record_key in ~/.hermes/config.yaml (default: ctrl+b).

Silence Detection

The detection algorithm works in two stages:

  1. Speech confirmation — waits for audio above the RMS threshold (200) for at least 0.3 seconds, tolerating brief dips between syllables
  2. End detection — once speech is confirmed, triggers after 3 seconds of continuous silence

If no speech is detected for 15 seconds, recording stops automatically. These parameters are configurable: silence_threshold and silence_duration in the configuration file. Beeps can be disabled with voice.beep_enabled: false.

Streaming TTS

When TTS is enabled, the agent speaks its reply sentence by sentence as it generates — you don't wait for the full response. The system:

  1. Buffers text deltas into complete sentences (minimum 20 characters)
  2. Strips markdown formatting and code blocks
  3. Generates and plays audio per sentence in real time

Whisper Hallucination Filter

Whisper sometimes generates phantom text from silence or background noise ("Thank you for watching", "Subscribe", etc.). Hermes automatically filters these artifacts using a set of 26 known hallucination phrases across multiple languages, plus a regex pattern that catches repetitive variations.

Speech Recognition (STT)

Hermes Agent supports multiple STT providers with automatic fallback:

Provider Model Speed Quality Cost API Key
Local (faster-whisper) base Fast (CPU/GPU) Good Free No
Local (faster-whisper) small Medium Better Free No
Local (faster-whisper) large-v3 Slow Best Free No
Groq whisper-large-v3-turbo Very fast (~0.5s) Good Free tier Yes
Groq whisper-large-v3 Fast (~1s) Better Free tier Yes
OpenAI whisper-1 Fast (~1s) Good Paid Yes
OpenAI gpt-4o-transcribe Medium (~2s) Best Paid Yes

Automatic fallback priority: local → groq → openai.

STT Configuration

In ~/.hermes/config.yaml:

stt:
  provider: "local"
  local:
    model: "base"                # tiny, base, small, medium, large-v3

API keys in ~/.hermes/.env:

GROQ_API_KEY=your-key
VOICE_TOOLS_OPENAI_KEY=your-key

Model overrides:

STT_GROQ_MODEL=whisper-large-v3-turbo
STT_OPENAI_MODEL=whisper-1

Text-to-Speech (TTS)

Hermes Agent supports ten TTS providers, from free to premium:

Provider Quality Cost Latency API Key
Edge TTS (default) Good Free ~1s No
ElevenLabs Excellent Paid ~2s Yes
OpenAI TTS Good Paid ~1.5s Yes
MiniMax TTS Excellent Paid ~1.5s Yes
Mistral (Voxtral) Excellent Paid ~2s Yes
Google Gemini TTS Excellent Free tier ~1s Yes
xAI TTS Excellent Paid ~1s Yes
NeuTTS Good Free (local) Variable No
KittenTTS Good Free (local) Variable No
Piper Good Free (local) Variable No

TTS Configuration

tts:
  provider: "edge"
  speed: 1.0

  edge:
    voice: "en-US-AriaNeural"    # 322 voices, 74 languages
    speed: 1.0

  openai:
    model: "gpt-4o-mini-tts"
    voice: "alloy"               # alloy, echo, fable, onyx, nova, shimmer
    speed: 1.0

  elevenlabs:
    voice_id: "pNInz6obpgDQGcFmaJgB"
    model_id: "eleven_multilingual_v2"

  minimax:
    model: "speech-2.8-hd"
    voice_id: "English_Graceful_Lady"

  xai:
    voice_id: "eve"
    language: "en"

Speed control: each provider can override the global multiplier. The hierarchy is: provider-specific speed → global tts.speed → 1.0 default.

Custom Command Providers

You can integrate any external TTS engine (VoxCPM, MLX-Kokoro, XTTS CLI) via a command-type provider in config.yaml. Hermes writes text to a temp file, runs your command, and reads the audio output — no Python needed.

Telegram Voice Messages

If you have already connected Telegram to Hermes Agent, voice messages work immediately with no additional configuration.

Receiving Voice Messages

Send a voice message to your Telegram bot — Hermes transcribes it automatically via Whisper and injects the transcript as text into the conversation. The agent sees the transcript as a normal message.

Sending Voice Replies

Enable voice mode in Telegram:

  • /voice on — voice reply only when you send a voice message
  • /voice tts — voice reply for all messages
  • /voice off — text only mode (default)

Delivery format: voice replies are sent as native Opus/OGG voice bubbles that play inline in chat. If ffmpeg is not installed, MP3-producing providers send a regular audio file instead.

ffmpeg Note

Some providers produce Opus natively (OpenAI, ElevenLabs, Mistral) — no conversion needed. Others like Edge TTS, MiniMax, xAI, NeuTTS, KittenTTS and Piper require ffmpeg for conversion to the Opus/OGG format expected by Telegram.

Tip: if you don't want to install ffmpeg, switch to OpenAI or ElevenLabs which produce native Opus.

Platform Compatibility

Voice mode works differently across platforms:

CLI

Full voice interaction with Ctrl+B for recording and real-time audio playback. Works in both the classic CLI (hermes chat) and the TUI (hermes --tui). See the CLI mastery guide for more details.

Telegram

  • Automatic send and receive of voice messages
  • Native Opus/OGG voice bubbles
  • Automatic STT transcription of received voice messages
  • TTS replies sent as audio

Discord

  • Voice messages in DMs and text channels (with @mention)
  • Voice channels: the bot joins the channel, listens to users, transcribes and speaks replies aloud
  • Voice channels require Connect + Speak + Use Voice Activity permissions
  • Automatic echo prevention: the bot mutes its listener during TTS playback

Known Limitations

  • WhatsApp: voice replies are sent as MP3 files (no native voice bubble)
  • Signal: no streaming or message editing — voice replies are sent as attachments
  • Local TTS (NeuTTS, KittenTTS, Piper) depends on your machine's CPU/GPU performance

Real-World Use Cases

Accessibility

Voice mode makes Hermes Agent accessible to users with keyboard difficulties. Whether for navigation, writing, or reviewing results, voice offers a natural and efficient alternative.

Mobility

On the go, Telegram voice messages let you interact with the agent without typing. Perfect for requesting a meeting summary, checking quick information, or delegating a task to the agent during a commute.

Tutorials and Podcasts

Hermes can transform an article into an audio file using TTS, create voice summaries of documents, or serve as a base for audio content. Combine this with the Skills system to automate voice content production.

Meetings and Note-Taking

With continuous voice recognition in the CLI, you can dictate notes, have the agent structure them, and get a clean summary — all by voice.

Complete Reference Configuration

Example ~/.hermes/config.yaml for voice mode:

voice:
  record_key: "ctrl+b"
  max_recording_seconds: 120
  auto_tts: false
  beep_enabled: true
  silence_threshold: 200
  silence_duration: 3.0

stt:
  provider: "local"
  local:
    model: "base"

tts:
  provider: "edge"
  speed: 1.0
  edge:
    voice: "en-US-AriaNeural"
    speed: 1.0

API keys in ~/.hermes/.env:

GROQ_API_KEY=your-key
VOICE_TOOLS_OPENAI_KEY=your-key
ELEVENLABS_API_KEY=your-key

Common Troubleshooting

"No audio device found" (CLI)

PortAudio is not installed:

brew install portaudio              # macOS
sudo apt install portaudio19-dev    # Ubuntu

Discord bot doesn't respond in server channels

The bot requires an @mention by default in server channels. Select the bot user (with the #discriminator), not the role with the same name. Or use DMs. You can also disable the requirement:

DISCORD_REQUIRE_MENTION=false

Discord bot can't hear me in voice channel

  • Verify your Discord user ID is in DISCORD_ALLOWED_USERS
  • Make sure you are not muted
  • The bot needs a SPEAKING event from Discord — speak within a few seconds of joining

Whisper returns garbage text

The hallucination filter catches most cases. If issues persist:

  • Use a quieter environment
  • Increase silence_threshold in config (higher = less sensitive)
  • Try a different STT model (switch from base to small or large-v3)

TTS produces no audio

  • Check the TTS provider API key and quota
  • Edge TTS (free, no key) is the default fallback
  • Check logs: tail -f ~/.hermes/logs/gateway.log

Telegram voice messages appear as files, not bubbles

Install ffmpeg:

sudo apt install ffmpeg

Or switch to a TTS provider that produces native Opus (OpenAI, ElevenLabs, Mistral).

Conclusion

✅ Conclusion

Hermes Agent voice mode radically transforms how you interact with your AI assistant. With simple installation (pip install "hermes-agent[voice]"), free providers like Edge TTS and faster-whisper, and broad compatibility across CLI, Telegram, and Discord, it has never been easier to talk to your assistant.

The configuration flexibility — from STT/TTS provider selection to silence detection parameters — adapts voice mode to every context: accessibility, mobility, audio content production, or simply the convenience of oral conversation. To extend this setup, explore the multi-platform gateway that connects Hermes to all your messaging channels.