Open Source Voice Interface
Kiwi Voice
ML wake word detection, speaker identification, voice‑gated security, 5 TTS engines, 15 languages, and a real‑time web dashboard — for your own AI stack.
How it works¶
Kiwi Voice turns your OpenClaw agent into a hands-free assistant. It captures audio from your microphone (or directly from the browser), detects the wake word, transcribes speech locally, identifies who is speaking, enforces security policies, sends the command to any LLM through OpenClaw's WebSocket gateway, and speaks the response back — all in a continuous loop.
You: "Kiwi, turn on the lights in the bedroom"
Kiwi: [identifies speaker as Owner → full access]
[sends to OpenClaw → routes to Home Assistant]
"Done, the bedroom lights are on."
Think Alexa or Siri, but self-hosted, privacy-first, and plugged into your own AI stack.
Features
Wake Word Detection Text fuzzy matching or OpenWakeWord ML — ONNX model, ~80ms latency, ~2% CPU. Built-in models or train your own. Speaker Identification Voiceprint recognition via pyannote embeddings. Priority hierarchy: Owner → Friend → Guest → Blocked. Two-Layer Security Pre-LLM dangerous command detector + post-LLM exec approval. Telegram notifications for non-owner actions. 5 TTS Providers ElevenLabs, Kokoro ONNX, Piper, Qwen3-TTS. Streaming sentence-aware chunking with barge-in support. Web Dashboard & API Glassmorphism dark dashboard with live status, event log, personalities, speaker management, and browser mic. Home Assistant Bidirectional integration. Control Kiwi from HA dashboard, control your smart home by voice through Kiwi. 15 Languages Full i18n with YAML locales. All strings, voice commands, wake word variants, and security patterns per-language. Personality System 5 built-in "souls" — switch by voice, API, or dashboard. NSFW routes to a separate isolated LLM session.
Quick Start¶
git clone https://github.com/ekleziast/kiwi-voice.git
cd kiwi-voice
pip install -r requirements.txt
cp .env.example .env
python -m kiwi
Open http://localhost:7789 for the web dashboard.
Architecture¶
Mic (24kHz) / Browser WebSocket → Audio Pipeline (Silero VAD + energy detection)
→ Wake Word (OpenWakeWord ML or text fuzzy match)
→ Faster Whisper STT (or MLX Whisper on Apple Silicon)
→ Speaker ID (pyannote embeddings) → Priority Gate (Owner/Friend/Guest/Blocked)
→ Voice Security (dangerous command regex → Telegram approval)
→ OpenClaw Gateway (WebSocket v3)
→ LLM response stream (delta → sentence chunking)
→ Streaming TTS (Kokoro/Piper/Qwen3/ElevenLabs) → Speaker output + browser playback
→ Barge-in detection → back to listening