RecSys and Search in the LLM Era — What Actually Changed
How YouTube, Spotify, Netflix, and LinkedIn applied LLMs to recommendations and search
Based on Eugene Yan's analysis, here are four axes along which LLMs are changing recommendations and search.
1. LLM/Multimodal-Enhanced Architectures
Traditional RecSys runs on item IDs. User A viewed items 123, 456, so recommend 789. The problem: no information for new items (cold start) or unpopular items (long tail).
LLMs and multimodal models fix this by understanding item text, images, and audio.
YouTube's Semantic IDs — Content-derived IDs instead of hash-based ones. Transformer creates video embeddings, RQ-VAE converts them to integer Semantic IDs. N-gram/SentencePiece approaches worked particularly well for cold start.
Kuaishou's M3CSR — Merges visual (ResNet), text (Sentence-BERT), audio (VGGish) embeddings, K-means clusters them into learnable IDs. A/B test: clicks +3.4%, likes +3.0%, follows +3.1%.
Google's CALRec — Fine-tuned PaLM-2 XXS for recommendations via text prompts. Two-stage: multi-category pretraining → category-specific fine-tuning.
Meta's EmbSum — Summarizes user interests and candidate items separately using T5-small and Mixtral-8x22B, then matches them.
2. LLM-Powered Data Generation
Using LLMs to create data for recommendation systems rather than doing recommendations directly.
Bing — GPT-4 generates webpage titles/summaries. Fine-tuned Mistral-7B on 2M pages. Clickbait -31%, duplicate content -76%, authoritative content +18%.
Indeed — Fine-tuned GPT-3.5 to filter bad job matches (eBadMatch). Invitation emails -17.68%, unsubscribes -4.97%, applications +4.13%.
Spotify — Introduced exploratory query recommendations. LLM-generated queries ranked with personalized embeddings. Exploratory queries +9%.
3. Scaling Laws, Transfer Learning, Distillation, LoRA
Core LLM techniques now applied to RecSys.
Scaling Laws — Decoder-only Transformers from 98.3K to 0.8B params. Bigger models need less data for good performance.
YouTube Knowledge Distillation — Teacher model (2-4x larger) knowledge transferred to student. +0.4% improvement (significant in RecSys).
DLLM2Rec — Distills LLM recommendation knowledge to lightweight models. Inference: 3-6 hours → 1.6-1.8 seconds. Average performance +47.97%.
Alibaba MLoRA — Domain-specific LoRA for CTR prediction. CTR +1.49%, conversion +3.37%.
4. Unified Search & Recommendation
LinkedIn 360Brew — Single 150B-param model handles 30+ ranking tasks. Prompt engineering instead of feature engineering. Matches or beats specialized models.
Netflix UniCoRn — Unified model for search and recommendation. Recommendations +10%, search +7%.
Etsy Unified Embeddings — Transformer + T5 text + graph embeddings. Graph embeddings contributed most (+15%). Conversion +2.63%.
What to Take Away
The pattern: rather than using LLMs as recommendation models directly, (1) generate data, (2) distill knowledge to lightweight models, or (3) add multimodal understanding. The most immediately applicable is LLM-powered data generation — metadata enrichment, query generation, quality filtering work without changing existing pipelines.
How It Works
Solve cold start with LLM/multimodal — Semantic IDs, multimodal embedding + clustering
Generate data with LLMs — metadata enrichment, query generation, quality filtering
Distill LLM knowledge to lightweight models — inference time reduced by orders of magnitude
Domain-specific fine-tuning with LoRA — shared backbone + domain adapters
Unified search and recommendation — single model handles query-based + history-based tasks
Pros
- ✓ Solves cold start/long tail — multimodal understanding enables new item recommendations
- ✓ Data quality improvement — LLMs auto-generate metadata, queries, filters
- ✓ Unified architecture — merging search/recommendations reduces maintenance cost
- ✓ Distillation makes it practical — compress LLM-level performance to servable size
Cons
- ✗ Latency — direct LLM serving is still too heavy for real-time recommendations
- ✗ Unified models don't always beat specialized ones — BM25, SASRec still strong in some areas
- ✗ GPU infrastructure costs — significant compute needed for training and data generation