BERT4Rec
Masked Language Model for recommendations β learning preferences by filling blanks
SASRec uses left-to-right unidirectional attention. BERT4Rec (2019, Sun et al.) goes bidirectional.
In sequence [A, B, C, D, E], replace C with [MASK] and predict it using context from A, B, D, and E. Exactly what BERT does with text.
Unidirectional vs Bidirectional
SASRec: A β B β C β ? (predict right side only)
BERT4Rec: A β ? β C β D (use both sides)
Bidirectional seems intuitively stronger, but in actual serving you need "next item prediction," so at inference time you mask the last position.
Train-serve gap
Training masks random positions; serving predicts the last position. This gap can affect performance. Recent variants address this discrepancy.
How It Works
Replace random positions with [MASK] in user behavior sequence
Predict masked items with Bidirectional Transformer
Bidirectional context is reflected in representations
At serving time, mask last position to predict next item
Pros
- ✓ Bidirectional context β richer representations than unidirectional
- ✓ Can reuse BERT ecosystem tools and techniques
Cons
- ✗ Objective mismatch between training and serving (train-serve gap)
- ✗ Not always superior to SASRec (depends on data)