🤖

How Does ChatGPT Streaming Work?

Real-time Token-by-Token Response Delivery via SSE

When you set stream: true on OpenAI's Chat Completions API, the response is not returned all at once but transmitted in SSE (Server-Sent Events) format as each token is generated. The client receives this stream via EventSource or fetch + ReadableStream, with each chunk in data: {"choices":[{"delta":{"content":"Hi"}}]} format. The stream ends when data: [DONE] is received. Users can see results in real-time while the LLM is generating.

Architecture Diagram

🌐

ChatGPT UI

fetch + ReadableStream

① POST stream:true →

text/event-stream

← data: {"delta":{"content":"안"}}

← data: {"delta":{"content":"녕"}}

← data: {"delta":{"content":"하"}}

← data: {"delta":{"content":"세"}}

← data: [DONE]

🧠

OpenAI API

LLM token generation

Generate tokens one by one → send immediately

What appears on screen:

안녕하세

Key point: Instead of waiting for the full answer, <strong>tokens are sent via SSE as soon as they are generated</strong>

Why SSE? (Not WebSocket)

Single request (POST), only response is streamed → unidirectional is sufficient
HTTP-based, good CDN/proxy compatibility
On disconnection, retry with new request (stateless)
Simpler server implementation compared to WebSocket

How It Works

Client sends request to POST /v1/chat/completions with stream: true

Server starts response with Content-Type: text/event-stream

LLM generates token → data: {"delta":{"content":"Hi"}} sent immediately

Client receives each chunk and appends to UI

All tokens generated → data: [DONE] sent

Client handles stream termination

Pros

✓ Dramatically improved perceived response speed (minimized TTFT)
✓ No need to wait for full LLM generation
✓ Simple implementation as it is HTTP-based
✓ Can cancel mid-stream (AbortController)

Cons

✗ Token-by-token processing logic required
✗ Complex error handling (mid-stream disconnection)
✗ Total token count unknown in advance
✗ Client buffering management needed

Use Cases

ChatGPT / Claude web UI AI coding assistants AI chatbot interfaces Real-time document summarization/translation display

📨

SSE (Server-Sent Events)

Server→Client Unidirectional Stream

→

← 💬 How Does Slack Socket Mode Work? 🐙 How Do GitHub Webhooks Work? →