Hermes Agent Voice Mode: TTS, STT, and Oral Conversations

Hermes Agent 🟡 Intermediate ⏱️ 13 min read 📅 2026-05-05

Introduction

Talking to an AI agent with your voice, hearing its replies, having a natural conversation — this is no longer science fiction. Hermes Agent's voice mode makes bidirectional oral conversation a reality. Whether you are at the CLI, on Telegram, or in a Discord voice channel, Hermes turns text-based interaction into a genuine spoken dialogue.

This article covers installation, configuration, and practical usage of Hermes Agent voice mode: speech recognition (STT), text-to-speech (TTS), and real-world use cases. If you haven't installed Hermes Agent yet, start with the complete installation guide.

Installing Voice Mode

System Dependencies

Before enabling voice, make sure you have the audio dependencies installed:

PortAudio — for microphone capture and audio playback in the CLI
ffmpeg — for audio format conversion (MP3 to Opus, PCM to WAV)
opus — audio codec required for Discord
espeak-ng — phonemizer for optional local TTS (NeuTTS)

# macOS
brew install portaudio ffmpeg opus espeak-ng

# Ubuntu / Debian
sudo apt install portaudio19-dev ffmpeg libopus0 espeak-ng

# Fedora
sudo dnf install portaudio-devel ffmpeg opus espeak-ng

Python Installation

Hermes Agent provides a dedicated voice extra that installs sounddevice and numpy:

pip install "hermes-agent[voice]"

If you use messaging (Telegram, Discord), the messaging extra is sufficient — it includes discord.py[voice] and python-telegram-bot:

pip install "hermes-agent[messaging]"

To install everything at once:

pip install "hermes-agent[all]"

Option: Local Whisper

For fully local speech recognition with zero API keys, install faster-whisper:

pip install faster-whisper

The model (approximately 150 MB for base) downloads automatically on first use. Hermes then works with zero API keys for speech recognition.

Activating Voice Mode in the CLI

Launch and Commands

Start the CLI and enable voice mode:

hermes              # Start the interactive CLI

Inside the CLI, use these slash commands:

/voice — toggle voice mode on/off
/voice on — enable voice mode (voice reply only when you send a voice message)
/voice tts — enable speech synthesis for all messages (text + voice)
/voice off — disable voice replies
/voice status — show current state

Recording with Ctrl+B

Here is how a voice conversation unfolds in the CLI:

Enable voice mode with /voice on
Press Ctrl+B — a beep plays (880 Hz), recording starts
Speak — a live audio level bar shows your input: ● [▁▂▃▅▇▇▅▂] ❯
Stop speaking — after 3 seconds of silence, recording auto-stops
Two beeps (660 Hz) confirm the end of recording
Audio is transcribed via Whisper and sent to the agent
If TTS is enabled, the agent's reply is spoken aloud
Recording automatically restarts — speak again without pressing any key

This loop continues until you press Ctrl+B during recording (exits continuous mode) or 3 consecutive recordings detect no speech.

Tip: the record key is configurable via voice.record_key in ~/.hermes/config.yaml (default: ctrl+b).

Silence Detection

The detection algorithm works in two stages:

Speech confirmation — waits for audio above the RMS threshold (200) for at least 0.3 seconds, tolerating brief dips between syllables
End detection — once speech is confirmed, triggers after 3 seconds of continuous silence

If no speech is detected for 15 seconds, recording stops automatically. These parameters are configurable: silence_threshold and silence_duration in the configuration file. Beeps can be disabled with voice.beep_enabled: false.

Streaming TTS

When TTS is enabled, the agent speaks its reply sentence by sentence as it generates — you don't wait for the full response. The system:

Buffers text deltas into complete sentences (minimum 20 characters)
Strips markdown formatting and code blocks
Generates and plays audio per sentence in real time

Whisper Hallucination Filter

Whisper sometimes generates phantom text from silence or background noise ("Thank you for watching", "Subscribe", etc.). Hermes automatically filters these artifacts using a set of 26 known hallucination phrases across multiple languages, plus a regex pattern that catches repetitive variations.

Speech Recognition (STT)

Hermes Agent supports multiple STT providers with automatic fallback:

Provider	Model	Speed	Quality	Cost	API Key
Local (faster-whisper)	`base`	Fast (CPU/GPU)	Good	Free	No
Local (faster-whisper)	`small`	Medium	Better	Free	No
Local (faster-whisper)	`large-v3`	Slow	Best	Free	No
Groq	`whisper-large-v3-turbo`	Very fast (~0.5s)	Good	Free tier	Yes
Groq	`whisper-large-v3`	Fast (~1s)	Better	Free tier	Yes
OpenAI	`whisper-1`	Fast (~1s)	Good	Paid	Yes
OpenAI	`gpt-4o-transcribe`	Medium (~2s)	Best	Paid	Yes

Automatic fallback priority: local → groq → openai.

STT Configuration

In ~/.hermes/config.yaml:

stt:
  provider: "local"
  local:
    model: "base"                # tiny, base, small, medium, large-v3

API keys in ~/.hermes/.env:

GROQ_API_KEY=your-key
VOICE_TOOLS_OPENAI_KEY=your-key

Model overrides:

STT_GROQ_MODEL=whisper-large-v3-turbo
STT_OPENAI_MODEL=whisper-1

Text-to-Speech (TTS)

Hermes Agent supports ten TTS providers, from free to premium:

Provider	Quality	Cost	Latency	API Key
Edge TTS (default)	Good	Free	~1s	No
ElevenLabs	Excellent	Paid	~2s	Yes
OpenAI TTS	Good	Paid	~1.5s	Yes
MiniMax TTS	Excellent	Paid	~1.5s	Yes
Mistral (Voxtral)	Excellent	Paid	~2s	Yes
Google Gemini TTS	Excellent	Free tier	~1s	Yes
xAI TTS	Excellent	Paid	~1s	Yes
NeuTTS	Good	Free (local)	Variable	No
KittenTTS	Good	Free (local)	Variable	No
Piper	Good	Free (local)	Variable	No

TTS Configuration

tts:
  provider: "edge"
  speed: 1.0

  edge:
    voice: "en-US-AriaNeural"    # 322 voices, 74 languages
    speed: 1.0

  openai:
    model: "gpt-4o-mini-tts"
    voice: "alloy"               # alloy, echo, fable, onyx, nova, shimmer
    speed: 1.0

  elevenlabs:
    voice_id: "pNInz6obpgDQGcFmaJgB"
    model_id: "eleven_multilingual_v2"

  minimax:
    model: "speech-2.8-hd"
    voice_id: "English_Graceful_Lady"

  xai:
    voice_id: "eve"
    language: "en"

Speed control: each provider can override the global multiplier. The hierarchy is: provider-specific speed → global tts.speed → 1.0 default.

Custom Command Providers

You can integrate any external TTS engine (VoxCPM, MLX-Kokoro, XTTS CLI) via a command-type provider in config.yaml. Hermes writes text to a temp file, runs your command, and reads the audio output — no Python needed.

Telegram Voice Messages

If you have already connected Telegram to Hermes Agent, voice messages work immediately with no additional configuration.

Receiving Voice Messages

Send a voice message to your Telegram bot — Hermes transcribes it automatically via Whisper and injects the transcript as text into the conversation. The agent sees the transcript as a normal message.

Sending Voice Replies

Enable voice mode in Telegram:

/voice on — voice reply only when you send a voice message
/voice tts — voice reply for all messages
/voice off — text only mode (default)

Delivery format: voice replies are sent as native Opus/OGG voice bubbles that play inline in chat. If ffmpeg is not installed, MP3-producing providers send a regular audio file instead.

ffmpeg Note

Some providers produce Opus natively (OpenAI, ElevenLabs, Mistral) — no conversion needed. Others like Edge TTS, MiniMax, xAI, NeuTTS, KittenTTS and Piper require ffmpeg for conversion to the Opus/OGG format expected by Telegram.

Tip: if you don't want to install ffmpeg, switch to OpenAI or ElevenLabs which produce native Opus.

Platform Compatibility

Voice mode works differently across platforms:

CLI

Full voice interaction with Ctrl+B for recording and real-time audio playback. Works in both the classic CLI (hermes chat) and the TUI (hermes --tui). See the CLI mastery guide for more details.

Automatic send and receive of voice messages
Native Opus/OGG voice bubbles
Automatic STT transcription of received voice messages
TTS replies sent as audio

Discord

Voice messages in DMs and text channels (with @mention)
Voice channels: the bot joins the channel, listens to users, transcribes and speaks replies aloud
Voice channels require Connect + Speak + Use Voice Activity permissions
Automatic echo prevention: the bot mutes its listener during TTS playback

Known Limitations

WhatsApp: voice replies are sent as MP3 files (no native voice bubble)
Signal: no streaming or message editing — voice replies are sent as attachments
Local TTS (NeuTTS, KittenTTS, Piper) depends on your machine's CPU/GPU performance

Real-World Use Cases

Accessibility

Voice mode makes Hermes Agent accessible to users with keyboard difficulties. Whether for navigation, writing, or reviewing results, voice offers a natural and efficient alternative.

Mobility

On the go, Telegram voice messages let you interact with the agent without typing. Perfect for requesting a meeting summary, checking quick information, or delegating a task to the agent during a commute.

Tutorials and Podcasts

Hermes can transform an article into an audio file using TTS, create voice summaries of documents, or serve as a base for audio content. Combine this with the Skills system to automate voice content production.

Meetings and Note-Taking

With continuous voice recognition in the CLI, you can dictate notes, have the agent structure them, and get a clean summary — all by voice.

Complete Reference Configuration

Example ~/.hermes/config.yaml for voice mode:

voice:
  record_key: "ctrl+b"
  max_recording_seconds: 120
  auto_tts: false
  beep_enabled: true
  silence_threshold: 200
  silence_duration: 3.0

stt:
  provider: "local"
  local:
    model: "base"

tts:
  provider: "edge"
  speed: 1.0
  edge:
    voice: "en-US-AriaNeural"
    speed: 1.0

API keys in ~/.hermes/.env:

GROQ_API_KEY=your-key
VOICE_TOOLS_OPENAI_KEY=your-key
ELEVENLABS_API_KEY=your-key

Common Troubleshooting

"No audio device found" (CLI)

PortAudio is not installed:

brew install portaudio              # macOS
sudo apt install portaudio19-dev    # Ubuntu

Discord bot doesn't respond in server channels

The bot requires an @mention by default in server channels. Select the bot user (with the #discriminator), not the role with the same name. Or use DMs. You can also disable the requirement:

DISCORD_REQUIRE_MENTION=false

Discord bot can't hear me in voice channel

Verify your Discord user ID is in DISCORD_ALLOWED_USERS
Make sure you are not muted
The bot needs a SPEAKING event from Discord — speak within a few seconds of joining

Whisper returns garbage text

The hallucination filter catches most cases. If issues persist:

Use a quieter environment
Increase silence_threshold in config (higher = less sensitive)
Try a different STT model (switch from base to small or large-v3)

TTS produces no audio

Check the TTS provider API key and quota
Edge TTS (free, no key) is the default fallback
Check logs: tail -f ~/.hermes/logs/gateway.log

Telegram voice messages appear as files, not bubbles

Install ffmpeg:

sudo apt install ffmpeg

Or switch to a TTS provider that produces native Opus (OpenAI, ElevenLabs, Mistral).

Conclusion

✅ Conclusion

Hermes Agent voice mode radically transforms how you interact with your AI assistant. With simple installation (pip install "hermes-agent[voice]"), free providers like Edge TTS and faster-whisper, and broad compatibility across CLI, Telegram, and Discord, it has never been easier to talk to your assistant.

The configuration flexibility — from STT/TTS provider selection to silence detection parameters — adapts voice mode to every context: accessibility, mobility, audio content production, or simply the convenience of oral conversation. To extend this setup, explore the multi-platform gateway that connects Hermes to all your messaging channels.

#Hermes Agent #STT #TTS #Voice #ia #telegram

📚 Related articles

Hermes Agent 🟢 Débutant 13 min

Hermes Agent: Complete Presentation and Installation Guide

Discover Hermes Agent, the most complete open source AI agent. Step-by-step installation guide: local, VPS, Android. 68 tools, multi-platform, free.

2026-05-05 14:42

Hermes Agent 🟢 Débutant 11 min

Configure models and providers in Hermes Agent

Complete guide to setting up AI models and providers in Hermes Agent: Anthropic, OpenRouter, DeepSeek, GitHub Copilot, and custom endpoints.

2026-05-05 14:51

Hermes Agent 🟡 Intermédiaire 12 min

Hermes Agent: All 68 Built-in Tools — Complete Guide

Complete guide to all 68 Hermes Agent built-in tools: terminal, web, browser, vision, automation, and integrations.

2026-05-05 14:57

📑 Table of contents