🗣️

VOICEVOX — Inside a Free AI Text-to-Speech Engine

Text → Phoneme → Pitch → Waveform: How a 3-Stage Pipeline Enables User Editing

VOICEVOX is the most widely used free AI text-to-speech software for YouTube commentary and gameplay videos. It generates character voices like Zundamon and Shikoku Metan.

But few people know how it actually works internally.

3-Stage Cascade Architecture

VOICEVOX isn't an end-to-end model like VITS. It runs 3 independent DNN models sequentially in a cascade pipeline.

Stage 1 — yukarin_s: Takes a phoneme ID array and predicts each phoneme's duration. How many seconds each k, o, N, n, i, ch, i, w, a of "konnichiwa" lasts.

Stage 2 — yukarin_sa: Takes vowel/consonant IDs + accent positions and predicts per-mora f0 (pitch). This is where Japanese pitch accent gets determined.

Stage 3 — decode: Takes phoneme onehot vectors (45 types, per-frame) and f0 to generate a 24kHz audio waveform. Acts as the vocoder.

The reason for this separation: stages 1-2 results are returned to users as JSON (AudioQuery), letting them manually adjust pitch and duration before running stage 3. This enables fine-grained control impossible with end-to-end models.

Text → Phoneme Conversion

OpenJTalk (morphological analyzer) decomposes Japanese text into phonemes + accent information. VOICEVOX uses its own forked pyopenjtalk.

Flow: Japanese text → MeCab morphological analysis → NJD → HTS label generation → regex parsing → AccentPhrase list

45 phoneme types total. pau (pause), cl (geminate っ), N (nasal ん), uppercase vowels (A, I, U, E, O) for devoiced vowels.

3-Layer Software Structure

VOICEVOX Editor (Electron/TypeScript) → Engine (Python/FastAPI) → Core (Rust/ONNX Runtime)

Engine handles text preprocessing, API serving, and parameter editing. Actual DNN inference runs in Core via ONNX Runtime.

Engine → Core: What ctypes FFI Actually Does

Core is a shared library (.dll/.so/.dylib) written in Rust. Python Engine calls it directly via ctypes. ONNX Runtime is embedded inside the Core DLL — Python never touches ONNX sessions.

DLL loading order: load_runtime_lib() pre-loads the ONNX Runtime DLL first, then load_core() loads the Core DLL. Different binaries are selected by GPU type — CUDA GPU → DirectML GPU → CPU fallback.

Function signature binding: C functions in Core (yukarin_s_forward, decode_forward, etc.) get their argument types declared via ctypes: argtypes=(c_int, POINTER(c_long), POINTER(c_float)). All functions return c_bool; on failure, last_error_message() retrieves the error.

numpy ↔ C pointer conversion: This is the key part. numpy array data pointers are passed directly as C pointers: phoneme_list.ctypes.data_as(POINTER(c_long)) — no copy, the C function reads the same memory. Output buffers are pre-allocated as numpy arrays for the C function to write into directly. For decode: np.empty((length * 256,), dtype=np.float32) — frames × 256 = audio samples.

Thread safety: CoreAdapter wraps every inference call with threading.Lock(). ONNX Runtime sessions aren't thread-safe, so Core functions are only called inside with self.mutex:.

Pre/post silence padding: yukarin_s and yukarin_sa prepend/append silence (0) to input arrays — np.r_[0, phoneme_list, 0]. Results are trimmed on both ends after inference. Required by Core spec.

Inside Core (Rust)

Rust C ABI functions (#[unsafe(no_mangle)] pub extern "C" fn yukarin_s_forward(...)) receive C pointers, convert them to Rust arrays via unsafe { std::slice::from_raw_parts } or ndarray::ArrayView::from_shape_ptr, run ONNX inference, and copy results back to C pointers.

ONNX Runtime integration uses the ort crate (pykeio/ort). libloading::Library::new() dynamically loads the ONNX Runtime SO/DLL. GPU selection registers CUDAExecutionProvider or DirectMLExecutionProvider on the session builder. Models are stored as .onnx files inside .vvm archives (ZIP).

GPU Acceleration

ONNX Runtime handles GPU inference. ~330x faster on GPU (A4000) vs CPU. Supports CUDA and DirectML backends. VOICEVOX uses a custom-built ONNX Runtime.

Why It Matters for YouTube Creators

It's free. Commercial use is allowed (credit required). Connects directly to YMM4 via HTTP API. Emotion styles per character (sweet, tsundere, whisper, etc.) are switchable. Solves the monetization uncertainty of legacy Yukkuri voices.

Running as a Docker HTTP Server

VOICEVOX Engine is a FastAPI-based HTTP server. You can spin it up with Docker without any local installation.

docker pull voicevox/voicevox_engine:cpu-ubuntu20.04-latest
docker run --rm -p 50021:50021 voicevox/voicevox_engine:cpu-ubuntu20.04-latest

GPU version available too — use the nvidia-ubuntu20.04-latest tag with --gpus all.

Once running, http://localhost:50021/docs shows the full Swagger UI. Two core APIs:

POST /audio_query?text=こんにちは&speaker=3 — converts text to AudioQuery (JSON)
POST /synthesis?speaker=3 — takes AudioQuery and returns WAV audio

One curl command generates speech. Handy for automating TTS from scripts or server-side without the GUI editor. YMM4 also calls this HTTP API internally.

3-Stage Inference Pipeline

Stage	Model	Input	Output
1	yukarin_s	Phoneme ID array + style_id	Per-phoneme duration (sec)
2	yukarin_sa	Vowel/consonant ID + accent position + style_id	Per-mora f0 (pitch, Hz)
3	decode	Phoneme onehot (45 types, per-frame) + f0	24kHz audio waveform

Engine → Core: ctypes FFI Details

Python Engine calls Rust Core's C ABI functions directly via ctypes. ONNX Runtime is encapsulated inside the Core DLL — Python never touches sessions.

Key Patterns

✓
Zero-copy — C function reads/writes numpy array memory directly via array.ctypes.data_as(POINTER(c_long))
✓
threading.Lock() — Serializes all Core calls. ONNX Runtime sessions aren't thread-safe
✓
Lazy loading — Models loaded per style_id on first use. Reduces startup time
✓
Error propagation — C function returns false → last_error_message() converts Rust errors to Python exceptions

Inside Core (Rust → ONNX Runtime)

Rust C ABI functions receive C pointers → convert via unsafe { slice::from_raw_parts } to ndarray → run ONNX session → as_standard_layout() for C-contiguous guarantee → copy back to output pointer. GPU selection registers CUDAExecutionProvider or DirectMLExecutionProvider on the session builder. Models stored as .onnx inside .vvm ZIP archives.

Why Cascade Instead of End-to-End?

End-to-end models like VITS go from text straight to audio. Convenient, but you can't touch intermediate results. \"I want to raise the pitch on this word\" is impossible.

VOICEVOX returns stages 1-2 results as AudioQuery (JSON). Users draw pitch curves in the editor UI, adjust phoneme durations, then run only stage 3 (vocoder). This \"editable intermediate representation\" is VOICEVOX's core design philosophy.

3-Layer Software Architecture

Editor Electron/TypeScript — GUI editor. Pitch curve editing, character selection, playback

↓ HTTP (localhost:50021)

Engine Python/FastAPI — Text preprocessing (OpenJTalk), API server, parameter editing, morphing

↓ ctypes FFI (C ABI)

Core Rust — DNN inference via ONNX Runtime. Runs yukarin_s / yukarin_sa / decode models

↓ ONNX Runtime

GPU/CPU CUDA (Nvidia) / DirectML (Nvidia+AMD, Windows) / CPU fallback

Developer Background

Hiho (Hiroshiba) — ML Engineer at Dwango Media Village. Creator of \"yukarin\" AI voice conversion library. First release: August 2021. Goal: connect people who want to create free character voices with people who want to use them.

Step-by-Step

Input Japanese text → OpenJTalk performs morphological analysis + generates AccentPhrases

yukarin_s model predicts per-phoneme duration (length)

yukarin_sa model predicts per-mora f0 (pitch)

Returns AudioQuery (JSON) to user — pitch, speed, intonation are editable

decode model (vocoder) generates 24kHz WAV waveform from edited parameters

Pros

✓ Completely free + commercial use allowed (credit required)
✓ 25+ characters × 8 emotion styles — wide voice selection
✓ Manual fine-tuning of pitch/speed/intonation via AudioQuery
✓ ~330x faster with GPU vs CPU (ONNX Runtime)

Cons

✗ Japanese only — no English/Korean speech synthesis
✗ Slow synthesis on CPU without GPU (CUDA)
✗ Multi-GPU environment (Optimus, etc.) issues reported with DirectML version
✗ Usage terms vary per character — must check in advance

Use Cases

Yukkuri commentary replacement — character voices without monetization issues Fully free YouTube video production with YMM4 + VOICEVOX combo Diverse expression via character emotion styles (sweet/tsundere/whisper) Automation script/bot integration via REST API (localhost:50021)