AI token streaming isn't about SSE vs. WebSockets

AI token streaming isn't about SSE vs WebSockets — /dev/knill 👾">

At Ably, we’ve solved production token streaming, so you don’t have to. And the hard-part isn’t SSE or WebSockets. Ask an agentic coding tool or chatbot “how to stream AI tokens to a client in production” and it’ll give you a section of the answer on SSE vs WebSockets. But that’s not the question, or really the answer. In a pure comparison of using SSE or WebSockets as the transport, SSE is the simpler choice, and is also the better choice for most usecases. The architecture you should build for production token streaming looks like the diagram below. It’s got separation of ‘prompt’ request and ‘response’ stream, and a token cache/data store for storing the tokens in allowing for resume and reconnection. fig. 01ChatGPTs design

flowchart LR H[human] -->|1. prompt request| S[POST /messages] --> L[llm]:::accent L -.->|sse tokens| S DB[(token cache)] S -.->|store tokens| DB H -.->|2. stream repsonse|SS SS[GET /streams/:id] SS -.->|read tokens| DB classDef accent stroke:#7eb8a4,stroke-width:1.5px The LLM response tokens are threaded through some datastore. The request and response are separated.The WebSockets version looks almost exactly the same, except the client opens a WebSocket connection and sends the prompt as a message, and the server responds with token messages on the same connection. The token cache/data store is still needed for resume and reconnection. fig. 02WebSocket design

flowchart LR H[human] -->|open websocket| S[Server] --> L[llm]:::accent L -.->|sse tokens| S DB[(token cache)] S -.->|store tokens| DB S -.->|read tokens| DB S -.->|stream tokens on websocket| H classDef accent stroke:#7eb8a4,stroke-width:1.5px The same as the sse design, but with a websocket connectionWhy this works with SSE Most peoples system design is based around the idea that servers are stateless, and all the state is stored in a database. This allows the servers to horizontally scale to handle more requests, and allows for better scaling as any server can handle any request. There’s generally a load balancer in front of the servers that routes requests to them. Mostly that load balancer will be based on sharing the load across servers rather than any kind of sticky session or session affinity. Turns out, SSE drops into this architecture really nicely. The client makes a POST request to the server with a prompt, and gets a stream ID back. The client then connects to that stream and gets the token streaming response. Any server can handle the original prompt request, and any server can handle the stream response because tokens are threaded through the database. WebSockets is basically exactly the same, except the client opens a WebSocket connection and sends the prompt as a message on the connection. The response tokens are sent back on the same connection. The connection is longer-lived, and you have to build your own message/request framing layer for the shape of messages sent back and forth on the WebSocket. So WebSockets are more complex to build and maintain, and don’t really add any value in this architecture. So it seems like an obvious choice to go with SSE for streaming your response tokens. What other features do you need in a production deployment? So far we’ve got a design for streaming tokens to the client in the ‘happy path’; where nothing goes wrong. We’ve got the architecture to support the possible failure cases and how to recover from them, but we don’t have the features yet. So what else do you need to do to make those features possible? Reconnection, resume, and recovery The client needs to be able to reconnect to the stream if the connection drops, and resume from the last token that was sent to the client. This is where the token cache/data store comes in. The server needs to store the tokens so that a reconnecting client can pick up where it was before. But the client also needs to be able to indicate where it got to before the disconnect, and that needs to correlate with some position on the server. 10 11 12 13 14 15 "type": "text-delta", "delta": "hello", "position": 1 "type": "text-delta", "delta": " world", "position": 2 "type": "text-delta", "delta": "!", "position": 3

All the tokens need to a sequence ID in the stream, so that when the client reconnects it can say “the last token I got was at position 2”, and the server can then send tokens starting at position 3. The SSE spec has Last-Event-ID header built into it for this purpose, but you still need to build the plumbing to support it on the server and client side. Detecting dropped connections The server and client need to be able to detect when the connection has dropped, so that the client can attempt to reconnect. To do this, the server has to send a heartbeat message every so often (e.g. every 10 seconds) to keep the connection alive and to...

AI token streaming isn't about SSE vs. WebSockets

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down