How Does ChatGPT Streaming Work?
Real-time Token-by-Token Response Delivery via SSE
When you set stream: true on OpenAI's Chat Completions API, the response is not returned all at once but transmitted in SSE (Server-Sent Events) format as each token is generated. The client receives this stream via EventSource or fetch + ReadableStream, with each chunk in data: {"choices":[{"delta":{"content":"Hi"}}]} format. The stream ends when data: [DONE] is received. Users can see results in real-time while the LLM is generating.
Architecture Diagram
data: {"delta":{"content":"์"}}
data: {"delta":{"content":"๋
"}}
data: {"delta":{"content":"ํ"}}
data: {"delta":{"content":"์ธ"}}
data: [DONE]
- Single request (POST), only response is streamed โ unidirectional is sufficient
- HTTP-based, good CDN/proxy compatibility
- On disconnection, retry with new request (stateless)
- Simpler server implementation compared to WebSocket
How It Works
Client sends request to POST /v1/chat/completions with stream: true
Server starts response with Content-Type: text/event-stream
LLM generates token โ data: {"delta":{"content":"Hi"}} sent immediately
Client receives each chunk and appends to UI
All tokens generated โ data: [DONE] sent
Client handles stream termination
Pros
- ✓ Dramatically improved perceived response speed (minimized TTFT)
- ✓ No need to wait for full LLM generation
- ✓ Simple implementation as it is HTTP-based
- ✓ Can cancel mid-stream (AbortController)
Cons
- ✗ Token-by-token processing logic required
- ✗ Complex error handling (mid-stream disconnection)
- ✗ Total token count unknown in advance
- ✗ Client buffering management needed