πŸ”’

What Is Embedding?

You need numbers to compute β€” the starting point of all AI

Computers don't know that "cat" and "dog" are similar. They're different strings. Embedding solves this β€” similar meanings become nearby numbers, different meanings become distant numbers.

"cat" β†’ [0.82, -0.15, 0.41, ...]
"dog" β†’ [0.79, -0.12, 0.38, ...]
"car" β†’ [-0.33, 0.67, -0.21, ...]

Cat and dog vectors are close, car is far. That's all embedding is.

Embedding is the goal, methods vary

"I want to convert words into vectors" β€” that's the goal called embedding. The method has evolved over time.

Statistics-based (pre-embedding era):

  • TF-IDF: Vectors from word frequency. No semantics

  • LSA: Dimensionality reduction via matrix decomposition. Slight semantic capture

Neural network-based (2013~):

  • Word2Vec: 2-layer neural net. Made "king - man + woman = queen" possible

  • GloVe: Hybrid of co-occurrence statistics + matrix decomposition

  • FastText: Character-level decomposition, robust to typos and neologisms

Transformer-based (2018~):

  • BERT: Context-aware embeddings. Same "bank" gets different vectors for finance vs. riverbank

  • GPT family: Large-scale pretrained models

  • OpenAI text-embedding-3-small: 1536 dimensions, multilingual, ready via API

Why embeddings matter for RecSys

The core question in recommendation is "will this user like this item?"

Put users and items in the same vector space, and distance becomes preference. Close means likely to enjoy, far means unlikely.

This idea is shared by most RecSys approaches from Matrix Factorization to Two-Tower models.

Training your own vs. pretrained models

Training your own creates vectors from your data. Domain-optimized but needs tens of thousands of examples.

Pretrained models (OpenAI, etc.) take your text and return vectors from an already-trained model. Works with small datasets and supports multilingual out of the box. Most projects should start here.

How It Works

1

Input unstructured data (text, images, etc.)

2

Embedding model converts to fixed-size numeric vector

3

Measure similarity via vector distance (cosine similarity, etc.)

4

Close vectors = semantically similar things

Pros

  • Makes unstructured data mathematically comparable
  • Multilingual and multimodal unification happens naturally in vector space

Cons

  • Vectors alone cannot explain "why" things are similar (not interpretable)
  • Embedding quality heavily depends on training data β€” biased data produces biased vectors

Use Cases

Search β€” semantic matching between query and documents Recommendation β€” predict preference via user-item vector distance Classification β€” sentiment analysis, spam detection from vectors Clustering β€” grouping similar vectors together