VOICEVOX — Inside a Free AI Text-to-Speech Engine
Text → Phoneme → Pitch → Waveform: How a 3-Stage Pipeline Enables User Editing
VOICEVOX is the most widely used free AI text-to-speech software for YouTube commentary and gameplay videos. It generates character voices like Zundamon and Shikoku Metan.
But few people know how it actually works internally.
3-Stage Cascade Architecture
VOICEVOX isn't an end-to-end model like VITS. It runs 3 independent DNN models sequentially in a cascade pipeline.
Stage 1 — yukarin_s: Takes a phoneme ID array and predicts each phoneme's duration. How many seconds each k, o, N, n, i, ch, i, w, a of "konnichiwa" lasts.
Stage 2 — yukarin_sa: Takes vowel/consonant IDs + accent positions and predicts per-mora f0 (pitch). This is where Japanese pitch accent gets determined.
Stage 3 — decode: Takes phoneme onehot vectors (45 types, per-frame) and f0 to generate a 24kHz audio waveform. Acts as the vocoder.
The reason for this separation: stages 1-2 results are returned to users as JSON (AudioQuery), letting them manually adjust pitch and duration before running stage 3. This enables fine-grained control impossible with end-to-end models.
Text → Phoneme Conversion
OpenJTalk (morphological analyzer) decomposes Japanese text into phonemes + accent information. VOICEVOX uses its own forked pyopenjtalk.
Flow: Japanese text → MeCab morphological analysis → NJD → HTS label generation → regex parsing → AccentPhrase list
45 phoneme types total. pau (pause), cl (geminate っ), N (nasal ん), uppercase vowels (A, I, U, E, O) for devoiced vowels.
3-Layer Software Structure
VOICEVOX Editor (Electron/TypeScript) → Engine (Python/FastAPI) → Core (Rust/ONNX Runtime)
Engine handles text preprocessing, API serving, and parameter editing. Actual DNN inference runs in Core via ONNX Runtime.
Engine → Core: What ctypes FFI Actually Does
Core is a shared library (.dll/.so/.dylib) written in Rust. Python Engine calls it directly via ctypes. ONNX Runtime is embedded inside the Core DLL — Python never touches ONNX sessions.
DLL loading order: load_runtime_lib() pre-loads the ONNX Runtime DLL first, then load_core() loads the Core DLL. Different binaries are selected by GPU type — CUDA GPU → DirectML GPU → CPU fallback.
Function signature binding: C functions in Core (yukarin_s_forward, decode_forward, etc.) get their argument types declared via ctypes: argtypes=(c_int, POINTER(c_long), POINTER(c_float)). All functions return c_bool; on failure, last_error_message() retrieves the error.
numpy ↔ C pointer conversion: This is the key part. numpy array data pointers are passed directly as C pointers: phoneme_list.ctypes.data_as(POINTER(c_long)) — no copy, the C function reads the same memory. Output buffers are pre-allocated as numpy arrays for the C function to write into directly. For decode: np.empty((length * 256,), dtype=np.float32) — frames × 256 = audio samples.
Thread safety: CoreAdapter wraps every inference call with threading.Lock(). ONNX Runtime sessions aren't thread-safe, so Core functions are only called inside with self.mutex:.
Pre/post silence padding: yukarin_s and yukarin_sa prepend/append silence (0) to input arrays — np.r_[0, phoneme_list, 0]. Results are trimmed on both ends after inference. Required by Core spec.
Inside Core (Rust)
Rust C ABI functions (#[unsafe(no_mangle)] pub extern "C" fn yukarin_s_forward(...)) receive C pointers, convert them to Rust arrays via unsafe { std::slice::from_raw_parts } or ndarray::ArrayView::from_shape_ptr, run ONNX inference, and copy results back to C pointers.
ONNX Runtime integration uses the ort crate (pykeio/ort). libloading::Library::new() dynamically loads the ONNX Runtime SO/DLL. GPU selection registers CUDAExecutionProvider or DirectMLExecutionProvider on the session builder. Models are stored as .onnx files inside .vvm archives (ZIP).
GPU Acceleration
ONNX Runtime handles GPU inference. ~330x faster on GPU (A4000) vs CPU. Supports CUDA and DirectML backends. VOICEVOX uses a custom-built ONNX Runtime.
Why It Matters for YouTube Creators
It's free. Commercial use is allowed (credit required). Connects directly to YMM4 via HTTP API. Emotion styles per character (sweet, tsundere, whisper, etc.) are switchable. Solves the monetization uncertainty of legacy Yukkuri voices.
Running as a Docker HTTP Server
VOICEVOX Engine is a FastAPI-based HTTP server. You can spin it up with Docker without any local installation.
docker pull voicevox/voicevox_engine:cpu-ubuntu20.04-latest
docker run --rm -p 50021:50021 voicevox/voicevox_engine:cpu-ubuntu20.04-latest
GPU version available too — use the nvidia-ubuntu20.04-latest tag with --gpus all.
Once running, http://localhost:50021/docs shows the full Swagger UI. Two core APIs:
POST /audio_query?text=こんにちは&speaker=3— converts text to AudioQuery (JSON)POST /synthesis?speaker=3— takes AudioQuery and returns WAV audio
One curl command generates speech. Handy for automating TTS from scripts or server-side without the GUI editor. YMM4 also calls this HTTP API internally.
3-Stage Inference Pipeline
| Stage | Model | Input | Output |
|---|---|---|---|
| 1 | yukarin_s | Phoneme ID array + style_id | Per-phoneme duration (sec) |
| 2 | yukarin_sa | Vowel/consonant ID + accent position + style_id | Per-mora f0 (pitch, Hz) |
| 3 | decode | Phoneme onehot (45 types, per-frame) + f0 | 24kHz audio waveform |
Engine → Core: ctypes FFI Details
Python Engine calls Rust Core's C ABI functions directly via ctypes. ONNX Runtime is encapsulated inside the Core DLL — Python never touches sessions.
Key Patterns
- ✓Zero-copy — C function reads/writes numpy array memory directly via
array.ctypes.data_as(POINTER(c_long)) - ✓threading.Lock() — Serializes all Core calls. ONNX Runtime sessions aren't thread-safe
- ✓Lazy loading — Models loaded per style_id on first use. Reduces startup time
- ✓Error propagation — C function returns false →
last_error_message()converts Rust errors to Python exceptions
Inside Core (Rust → ONNX Runtime)
Rust C ABI functions receive C pointers → convert via unsafe { slice::from_raw_parts } to ndarray → run ONNX session → as_standard_layout() for C-contiguous guarantee → copy back to output pointer. GPU selection registers CUDAExecutionProvider or DirectMLExecutionProvider on the session builder. Models stored as .onnx inside .vvm ZIP archives.
Why Cascade Instead of End-to-End?
End-to-end models like VITS go from text straight to audio. Convenient, but you can't touch intermediate results. \"I want to raise the pitch on this word\" is impossible.
VOICEVOX returns stages 1-2 results as AudioQuery (JSON). Users draw pitch curves in the editor UI, adjust phoneme durations, then run only stage 3 (vocoder). This \"editable intermediate representation\" is VOICEVOX's core design philosophy.
3-Layer Software Architecture
Developer Background
Hiho (Hiroshiba) — ML Engineer at Dwango Media Village. Creator of \"yukarin\" AI voice conversion library. First release: August 2021. Goal: connect people who want to create free character voices with people who want to use them.
Step-by-Step
Input Japanese text → OpenJTalk performs morphological analysis + generates AccentPhrases
yukarin_s model predicts per-phoneme duration (length)
yukarin_sa model predicts per-mora f0 (pitch)
Returns AudioQuery (JSON) to user — pitch, speed, intonation are editable
decode model (vocoder) generates 24kHz WAV waveform from edited parameters
Pros
- ✓ Completely free + commercial use allowed (credit required)
- ✓ 25+ characters × 8 emotion styles — wide voice selection
- ✓ Manual fine-tuning of pitch/speed/intonation via AudioQuery
- ✓ ~330x faster with GPU vs CPU (ONNX Runtime)
Cons
- ✗ Japanese only — no English/Korean speech synthesis
- ✗ Slow synthesis on CPU without GPU (CUDA)
- ✗ Multi-GPU environment (Optimus, etc.) issues reported with DirectML version
- ✗ Usage terms vary per character — must check in advance