⚑

Transformer-based Recommendation

Self-Attention processes the entire behavior sequence at once

SASRec (2018) is the landmark model β€” the first to apply Transformer Self-Attention to recommendations.

Same goal as GRU4Rec β€” predict the next item. The difference is in methodology.

GRU vs Transformer

GRU processes sequences front-to-back. For the 10th item's representation to carry information from the 1st item, hidden state must propagate through 9 steps. Information dilutes.

Transformer lets every position attend to every other position directly. The 10th item can reference the 1st directly. Stronger on long-range dependencies, and far better for GPU parallelization.

Real-world impact

Alibaba, JD.com and other large e-commerce platforms reported significant CTR improvements after replacing GRU4Rec with Transformer-based models. However, larger models mean serving latency becomes an issue.

How It Works

1

Add Position Encoding to item sequence

2

Learn item-item relationships via Multi-Head Self-Attention

3

Refine representations with Feed-Forward + Layer Norm

4

Predict next item from the last position's output

Pros

  • Directly captures long-range dependencies (advantage over GRU)
  • Fast training via GPU parallelization

Cons

  • O(nΒ²) attention complexity β€” cost grows with sequence length
  • Serving latency harder to manage than GRU

Use Cases

Real-time recommendation in large-scale e-commerce "Up next" recommendation on video platforms