jamiepine / voicebox

The open-source voice synthesis studio

Voicebox声音克隆qwen3-tts
24.5k 3k 5访问 1 更新于 2026-03-31 12:20

Voicebox

Voicebox

The open-source voice synthesis studio.
Clone voices. Generate speech. Apply effects. Build voice-powered apps.
All running locally on your machine.

Downloads Release Stars License

voicebox.shDocsDownloadFeaturesAPI


Voicebox App Screenshot

Click the image above to watch the demo video on voicebox.sh


Voicebox Screenshot 2

Voicebox Screenshot 3


What is Voicebox?

Voicebox is a local-first voice cloning studio — a free and open-source alternative to ElevenLabs. Clone voices from a few seconds of audio, generate speech in 23 languages across 5 TTS engines, apply post-processing effects, and compose multi-voice projects with a timeline editor.

  • Complete privacy — models and voice data stay on your machine
  • 5 TTS engines — Qwen3-TTS, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, and HumeAI TADA
  • 23 languages — from English to Arabic, Japanese, Hindi, Swahili, and more
  • Post-processing effects — pitch shift, reverb, delay, chorus, compression, and filters
  • Expressive speech — paralinguistic tags like [laugh], [sigh], [gasp] via Chatterbox Turbo
  • Unlimited length — auto-chunking with crossfade for scripts, articles, and chapters
  • Stories editor — multi-track timeline for conversations, podcasts, and narratives
  • API-first — REST API for integrating voice synthesis into your own projects
  • Native performance — built with Tauri (Rust), not Electron
  • Runs everywhere — macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker

Download

PlatformDownload
macOS (Apple Silicon)Download DMG
macOS (Intel)Download DMG
WindowsDownload MSI
Dockerdocker compose up

View all binaries →

Linux — Pre-built binaries are not yet available. See voicebox.sh/linux-install for build-from-source instructions.


Features

Multi-Engine Voice Cloning

Five TTS engines with different strengths, switchable per-generation:

EngineLanguagesStrengths
Qwen3-TTS (0.6B / 1.7B)10High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper")
LuxTTSEnglishLightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU
Chatterbox Multilingual23Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more
Chatterbox TurboEnglishFast 350M model with paralinguistic emotion/sound tags
TADA (1B / 3B)10HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment

Emotions & Paralinguistic Tags

Type / in the text input to insert expressive tags that the model synthesizes inline with speech (Chatterbox Turbo):

[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]

Post-Processing Effects

8 audio effects powered by Spotify's pedalboard library. Apply after generation, preview in real time, build reusable presets.

EffectDescription
Pitch ShiftUp or down by up to 12 semitones
ReverbConfigurable room size, damping, wet/dry mix
DelayEcho with adjustable time, feedback, and mix
Chorus / FlangerModulated delay for metallic or lush textures
CompressorDynamic range compression
GainVolume adjustment (-40 to +40 dB)
High-Pass FilterRemove low frequencies
Low-Pass FilterRemove high frequencies

Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults.

Unlimited Generation Length

Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines.

  • Configurable auto-chunking limit (100–5,000 chars)
  • Crossfade slider (0–200ms) for smooth transitions
  • Max text length: 50,000 characters
  • Smart splitting respects abbreviations, CJK punctuation, and [tags]

Generation Versions

Every generation supports multiple versions with provenance tracking:

  • Original — clean TTS output, always preserved
  • Effects versions — apply different effects chains from any source version
  • Takes — regenerate with a new seed for variation
  • Source tracking — each version records its lineage
  • Favorites — star generations for quick access

Async Generation Queue

Generation is non-blocking. Submit and immediately start typing the next one.

  • Serial execution queue prevents GPU contention
  • Real-time SSE status streaming
  • Failed generations can be retried
  • Stale generations from crashes auto-recover on startup

Voice Profile Management

  • Create profiles from audio files or record directly in-app
  • Import/export profiles to share or back up
  • Multi-sample support for higher quality cloning
  • Per-profile default effects chains
  • Organize with descriptions and language tags

Stories Editor

Multi-voice timeline editor for conversations, podcasts, and narratives.

  • Multi-track composition with drag-and-drop
  • Inline audio trimming and splitting
  • Auto-playback with synchronized playhead
  • Version pinning per track clip

Recording & Transcription

  • In-app recording with waveform visualization
  • System audio capture (macOS and Windows)
  • Automatic transcription powered by Whisper (including Whisper Turbo)
  • Export recordings in multiple formats

Model Management

  • Per-model unload to free GPU memory without deleting downloads
  • Custom models directory via VOICEBOX_MODELS_DIR
  • Model folder migration with progress tracking
  • Download cancel/clear UI

GPU Support

PlatformBackendNotes
macOS (Apple Silicon)MLX (Metal)4-5x faster via Neural Engine
Windows / Linux (NVIDIA)PyTorch (CUDA)Auto-downloads CUDA binary from within the app
Linux (AMD)PyTorch (ROCm)Auto-configures HSA_OVERRIDE_GFX_VERSION
Windows (any GPU)DirectMLUniversal Windows GPU support
Intel ArcIPEX/XPUIntel discrete GPU acceleration
AnyCPUWorks everywhere, just slower

API

Voicebox exposes a full REST API for integrating voice synthesis into your own apps.

# Generate speechcurl -X POST http://localhost:17493/generate \  -H "Content-Type: application/json" \  -d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}' # List voice profilescurl http://localhost:17493/profiles # Create a profilecurl -X POST http://localhost:17493/profiles \  -H "Content-Type: application/json" \  -d '{"name": "My Voice", "language": "en"}'

Use cases: game dialogue, podcast production, accessibility tools, voice assistants, content automation.

Full API documentation available at http://localhost:17493/docs.


Tech Stack

LayerTechnology
Desktop AppTauri (Rust)
FrontendReact, TypeScript, Tailwind CSS
StateZustand, React Query
BackendFastAPI (Python)
TTS EnginesQwen3-TTS, LuxTTS, Chatterbox, Chatterbox Turbo, TADA
EffectsPedalboard (Spotify)
TranscriptionWhisper / Whisper Turbo (PyTorch or MLX)
InferenceMLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU)
DatabaseSQLite
AudioWaveSurfer.js, librosa

Roadmap

FeatureDescription
Real-time StreamingStream audio as it generates, word by word
Voice DesignCreate new voices from text descriptions
More ModelsXTTS, Bark, and other open-source voice models
Plugin ArchitectureExtend with custom models and effects
Mobile CompanionControl Voicebox from your phone

Development

See CONTRIBUTING.md for detailed setup and contribution guidelines.

Quick Start

git clone https://github.com/jamiepine/voicebox.gitcd voicebox just setup   # creates Python venv, installs all depsjust dev     # starts backend + desktop app

Install just: brew install just or cargo install just. Run just --list to see all commands.

Prerequisites: Bun, Rust, Python 3.11+, Tauri Prerequisites, and Xcode on macOS.

Building Locally

just build          # Build CPU server binary + Tauri appjust build-local    # (Windows) Build CPU + CUDA server binaries + Tauri app

Adding New Voice Models

The multi-engine architecture makes adding new TTS engines straightforward. A step-by-step guide covers the full process: dependency research, backend protocol implementation, frontend wiring, and PyInstaller bundling.

The guide is optimized for AI coding agents. An agent skill can pick up a model name and handle the entire integration autonomously — you just test the build locally.

Project Structure

voicebox/├── app/              # Shared React frontend├── tauri/            # Desktop app (Tauri + Rust)├── web/              # Web deployment├── backend/          # Python FastAPI server├── landing/          # Marketing website└── scripts/          # Build & release scripts

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

  1. Fork the repo
  2. Create a feature branch
  3. Make your changes
  4. Submit a PR

Security

Found a security vulnerability? Please report it responsibly. See SECURITY.md for details.


License

MIT License — see LICENSE for details.


voicebox.sh