🎭

Synctoon β€” AI Pipeline That Turns a Script into 2D Animated Video

Gemini directs emotion/action/background, Gentle handles lip-sync, Pillow composites frames

Feed it a narration audio file and a script text file, and out comes a 2D talking-head animation video. The character moves its mouth in sync with speech, changes facial expressions based on emotion, and gets scene-appropriate backgrounds.

Synctoon automates all of this.

3-Stage Pipeline

The overall flow is simple. Analyze β†’ Composite β†’ Compile.

Stage 1 β€” AI Analysis + Audio Alignment (core.py)

Gentle (a Docker-based forced aligner) takes the audio and text, returning per-word start/end timestamps. "Hello" occupies the 0.5s–0.8s window, for instance.

Then the text goes to Gemini 2.0 Flash eight times, each analyzing a different directing element:

Head direction (L/R/center), eye direction (L/R/center, 90% center bias), character assignment (who's speaking), emotion (14 types β€” happy, sad, angry, shock, evil_laugh, etc.), body pose (47 types β€” dancing, kung_fu, meditation, etc.), intensity (normal/high), zoom level (0/1/2), background (31 scenes β€” office, forest, bedroom, park, etc.)

AI responses are validated via marshmallow schemas. On validation failure, the error message gets appended to the prompt for a retry β€” up to 3 times. A self-correcting loop.

Next, g2p_en converts each English word to phonemes. "Hello" β†’ HH, AH, L, OW. The word's time window gets evenly distributed across phonemes to determine per-frame mouth shapes.

Finally, a per-frame CSV is generated at 24fps. Each frame contains character, emotion, body pose, head direction, eye direction, background, mouth shape, zoom, and blink data. A 3-frame blink sequence is auto-inserted every 80 frames.

Stage 2 β€” Frame Compositing (frame_generator.py)

Reads the CSV line by line, compositing 5 layers: background β†’ body β†’ head β†’ eyes β†’ mouth. Pillow (PIL) stacks PNG images layer by layer.

metadata.json defines each layer's position and size in pixels. Eyes go at (x, y) on the head, mouth at (x, y).

The smart part is duplicate frame caching. If the character+emotion+pose+head+eyes+mouth+background+zoom combination is identical, the previously composited PNG gets reused. During stretches where only the mouth stays the same, one frame covers dozens.

Stage 3 β€” Video Compilation (frame_to_video.py)

OpenCV's VideoWriter bundles the PNG sequence into a 24fps MP4. FFmpeg merges in the original audio for the final video.

Phoneme β†’ Mouth Shape Mapping

mouth_image.json maps each phoneme to a specific mouth image. Phoneme "AH" β†’ m_a_e_ah_h (happy expression), m_a_e_ah_s (sad expression). The same pronunciation uses different mouth assets depending on emotional state.

45 phonemes are grouped into 17 mouth shapes (visemes). One mouth shape covers several similar phonemes.

Asset System

Character assets are hierarchical:

images/characters/character_1/ contains body/, head/, eyes/, mouth/, background/ folders.

47 body poses. 14 emotions. 31 backgrounds. A decent combinatorial space for 2D puppet animation. Each folder holds multiple variant images, so even the same pose randomly selects different images each time.

Current Limitations

The prototype nature shows throughout the code. File paths are hardcoded (/home/oye/Downloads/...), API keys are exposed in source. Only one character is currently active (though the multi-character structure is in place).

The Gentle Docker container must be running, and there's a 6-second sleep between Gemini API calls (rate limit avoidance). Eight calls means the AI analysis stage alone takes 48+ seconds.

No web UI. Everything runs via CLI.

Still, the approach is interesting. Using an LLM as an "animation director" β€” reading text and deciding emotion, pose, camera, and background is work humans used to do, and this automates it with 8 prompts.

Pipeline Code Explorer

Click each card to expand the actual source code

Stage 1-AGentle β€” Forced Audio-Text Alignmentspeach_aligner.py β†—

POSTs audio + text to Docker container (port 49153), returns per-word start/end timestamps.

class TranscriptionService: def __init__(self, files, url="http://localhost:49153/transcriptions?async=false"): self.url = url; self.files = files def send_request(self): opened_files = [] for name, path, content_type in self.files: opened_files.append((name, (name, open(path, "rb"), content_type))) response = requests.post(self.url, files=opened_files) return response.json()
Stage 1-B8 Gemini API Calls β€” AI Directingcore.py β†—

Sends text to Gemini at 6s intervals for head, eyes, character, emotion, pose, intensity, zoom, background.

head_movement = analyzer.get_head_movement_instructions(transcript) time.sleep(6) emotions = analyzer.get_emotion(transcript, emotions) time.sleep(6) body_action = analyzer.get_body_action(transcript, body_actions) time.sleep(6) # ... 8 total update_values(response_json, head_movement, "head_direction", "M") update_values(response_json, emotions, "emotion", 1)
Validationmarshmallow Self-Correction Looptext_aligner.py β†—

Validates AI JSON with marshmallow schema. On failure, appends error to prompt and retries (max 3).

def _send_message_and_extract(self, prompt, schema): max_attempts = 3; attempts = 0 while attempts < max_attempts: response = self.chat.send_message(prompt) data = self.extract_json_content(response.text) status, message = validate_data(data, schema) if status: break prompt = prompt + "\n" + str(message) # append error! attempts += 1
Phonemesg2p_en β€” Word β†’ Phoneme + Frame Distributionadd_phonemes.py β†—

"Hello" β†’ HH, AH, L, OW. Time window split across phonemes for per-frame mouth shapes.

from g2p_en import G2p; g2p = G2p() def add_phonemes(data, FRAME_PER_SECOUND=24): for each_data in data.get("words"): each_data["phonemes"] = g2p(each_data["word"]) each_data["init_frame"] = math.ceil(float(each_data["start"]) * FRAME_PER_SECOUND) each_data["phonemes_frame"] = distribute_frames(each_data)
Stage 25-Layer Compositing — BG→Body→Head→Eyes→MouthCharacterManager.py ↗

Pixel coordinates from metadata.json place eyes+mouth on head, head on body, character on background.

def get_character(self, Character, Emotion, Body, ...): head, _ = self.get_asset(Character, "head", Head_Direction) eyes, eyes_meta = self.get_asset(Character, "eyes", Emotion, ...) mouth, mouth_meta = self.get_asset(Character, "mouth", ...) head = self.adding_eyes_and_mouth(head, eyes, mouth, eyes_meta, mouth_meta) body, body_meta = self.get_asset(Character, "body", Body) character = self.adding_head_and_body(body=body, head=head, metadata=body_meta) bg, bg_meta = self.get_asset(Character, "background", Background) scene = self.adding_background(body=character, background=bg, metadata=bg_meta, zoom=zoom)
CacheDuplicate Frame Cachingframe_generator.py β†—

String key from all visual params. Identical combo β†’ reuse cached PNG.

key = (character + emotion + body + head_direction + eyes_direction + background + mouth_emotion + mouth_name + str(zoom) + str(blink)) if key not in frame_data["key_counter"]: image, meta = manager.get_character(...) image.save(image_file) else: frame_data["key_counter"][key] += 1 # reuse!

Asset Layer Compositing Order

πŸŒ„ Backgroundβ†’ 🧍 Bodyβ†’ 😐 Headβ†’ πŸ‘€ Eyesβ†’ πŸ‘„ Mouth

Step-by-Step

1

Prepare script (.txt) and narration audio (.mp3)

2

Run Gentle Docker container β†’ forced audio-text alignment (per-word timestamps)

3

Call Gemini API 8 times β†’ auto-generate directing cues for emotion, pose, background, camera

4

Convert words β†’ phonemes via g2p_en, generate per-frame mouth shape CSV

5

Composite 5 layers (bg→body→head→eyes→mouth) with Pillow, cache duplicate frames

6

Compile PNG β†’ 24fps MP4 with OpenCV, merge audio with FFmpeg

Pros

  • Fully open-source β€” all code and assets available, freely customizable
  • LLM-based auto-directing β€” AI decides from 14 emotions, 47 poses, 31 backgrounds
  • Phoneme-level lip-sync β€” Gentle forced alignment + g2p produces natural mouth movement
  • Frame deduplication caching drastically reduces compositing time

Cons

  • Early prototype β€” hardcoded paths, exposed API keys, not production-ready
  • English only β€” g2p_en supports English phonemes only, no Korean/Japanese
  • 8 Gemini API calls + 6s intervals β†’ AI analysis alone takes 48+ seconds
  • Only 1 character asset active β€” multi-character structure exists but assets are missing

Use Cases

YouTube story channels β€” record narration, get character animation video automatically Educational content β€” convert lecture scripts into character-explained format Prototyping β€” quickly preview storyboard videos before full animation production