Synctoon β AI Pipeline That Turns a Script into 2D Animated Video
Gemini directs emotion/action/background, Gentle handles lip-sync, Pillow composites frames
Feed it a narration audio file and a script text file, and out comes a 2D talking-head animation video. The character moves its mouth in sync with speech, changes facial expressions based on emotion, and gets scene-appropriate backgrounds.
Synctoon automates all of this.
3-Stage Pipeline
The overall flow is simple. Analyze β Composite β Compile.
Stage 1 β AI Analysis + Audio Alignment (core.py)
Gentle (a Docker-based forced aligner) takes the audio and text, returning per-word start/end timestamps. "Hello" occupies the 0.5sβ0.8s window, for instance.
Then the text goes to Gemini 2.0 Flash eight times, each analyzing a different directing element:
Head direction (L/R/center), eye direction (L/R/center, 90% center bias), character assignment (who's speaking), emotion (14 types β happy, sad, angry, shock, evil_laugh, etc.), body pose (47 types β dancing, kung_fu, meditation, etc.), intensity (normal/high), zoom level (0/1/2), background (31 scenes β office, forest, bedroom, park, etc.)
AI responses are validated via marshmallow schemas. On validation failure, the error message gets appended to the prompt for a retry β up to 3 times. A self-correcting loop.
Next, g2p_en converts each English word to phonemes. "Hello" β HH, AH, L, OW. The word's time window gets evenly distributed across phonemes to determine per-frame mouth shapes.
Finally, a per-frame CSV is generated at 24fps. Each frame contains character, emotion, body pose, head direction, eye direction, background, mouth shape, zoom, and blink data. A 3-frame blink sequence is auto-inserted every 80 frames.
Stage 2 β Frame Compositing (frame_generator.py)
Reads the CSV line by line, compositing 5 layers: background β body β head β eyes β mouth. Pillow (PIL) stacks PNG images layer by layer.
metadata.json defines each layer's position and size in pixels. Eyes go at (x, y) on the head, mouth at (x, y).
The smart part is duplicate frame caching. If the character+emotion+pose+head+eyes+mouth+background+zoom combination is identical, the previously composited PNG gets reused. During stretches where only the mouth stays the same, one frame covers dozens.
Stage 3 β Video Compilation (frame_to_video.py)
OpenCV's VideoWriter bundles the PNG sequence into a 24fps MP4. FFmpeg merges in the original audio for the final video.
Phoneme β Mouth Shape Mapping
mouth_image.json maps each phoneme to a specific mouth image. Phoneme "AH" β m_a_e_ah_h (happy expression), m_a_e_ah_s (sad expression). The same pronunciation uses different mouth assets depending on emotional state.
45 phonemes are grouped into 17 mouth shapes (visemes). One mouth shape covers several similar phonemes.
Asset System
Character assets are hierarchical:
images/characters/character_1/ contains body/, head/, eyes/, mouth/, background/ folders.
47 body poses. 14 emotions. 31 backgrounds. A decent combinatorial space for 2D puppet animation. Each folder holds multiple variant images, so even the same pose randomly selects different images each time.
Current Limitations
The prototype nature shows throughout the code. File paths are hardcoded (/home/oye/Downloads/...), API keys are exposed in source. Only one character is currently active (though the multi-character structure is in place).
The Gentle Docker container must be running, and there's a 6-second sleep between Gemini API calls (rate limit avoidance). Eight calls means the AI analysis stage alone takes 48+ seconds.
No web UI. Everything runs via CLI.
Still, the approach is interesting. Using an LLM as an "animation director" β reading text and deciding emotion, pose, camera, and background is work humans used to do, and this automates it with 8 prompts.
Pipeline Code Explorer
Click each card to expand the actual source code
POSTs audio + text to Docker container (port 49153), returns per-word start/end timestamps.
Sends text to Gemini at 6s intervals for head, eyes, character, emotion, pose, intensity, zoom, background.
Validates AI JSON with marshmallow schema. On failure, appends error to prompt and retries (max 3).
"Hello" β HH, AH, L, OW. Time window split across phonemes for per-frame mouth shapes.
Pixel coordinates from metadata.json place eyes+mouth on head, head on body, character on background.
String key from all visual params. Identical combo β reuse cached PNG.
Asset Layer Compositing Order
Step-by-Step
Prepare script (.txt) and narration audio (.mp3)
Run Gentle Docker container β forced audio-text alignment (per-word timestamps)
Call Gemini API 8 times β auto-generate directing cues for emotion, pose, background, camera
Convert words β phonemes via g2p_en, generate per-frame mouth shape CSV
Composite 5 layers (bgβbodyβheadβeyesβmouth) with Pillow, cache duplicate frames
Compile PNG β 24fps MP4 with OpenCV, merge audio with FFmpeg
Pros
- ✓ Fully open-source β all code and assets available, freely customizable
- ✓ LLM-based auto-directing β AI decides from 14 emotions, 47 poses, 31 backgrounds
- ✓ Phoneme-level lip-sync β Gentle forced alignment + g2p produces natural mouth movement
- ✓ Frame deduplication caching drastically reduces compositing time
Cons
- ✗ Early prototype β hardcoded paths, exposed API keys, not production-ready
- ✗ English only β g2p_en supports English phonemes only, no Korean/Japanese
- ✗ 8 Gemini API calls + 6s intervals β AI analysis alone takes 48+ seconds
- ✗ Only 1 character asset active β multi-character structure exists but assets are missing