AI token streaming isn't about SSE vs WebSockets — /dev/knill<br>👾">
At Ably, we’ve solved production token streaming, so you don’t have to. And the hard-part isn’t SSE<br>or WebSockets.<br>Ask an agentic coding tool or chatbot “how to stream AI tokens to a client in production” and<br>it’ll give you a section of the answer on SSE vs WebSockets.<br>But that’s not the question, or really the answer.<br>In a pure comparison of using SSE or WebSockets as the transport, SSE is the simpler choice, and<br>is also the better choice for most usecases. The architecture you should build for production token<br>streaming looks like the diagram below. It’s got separation of ‘prompt’ request and ‘response’<br>stream, and a token cache/data store for storing the tokens in allowing for resume and reconnection.<br>fig. 01ChatGPTs design
flowchart LR<br>H[human] -->|1. prompt request| S[POST /messages] --> L[llm]:::accent<br>L -.->|sse tokens| S<br>DB[(token cache)]<br>S -.->|store tokens| DB<br>H -.->|2. stream repsonse|SS<br>SS[GET /streams/:id]<br>SS -.->|read tokens| DB<br>classDef accent stroke:#7eb8a4,stroke-width:1.5px<br>The LLM response tokens are threaded through some datastore. The request and response are separated.The WebSockets version looks almost exactly the same, except the client opens a WebSocket connection<br>and sends the prompt as a message, and the server responds with token messages on the same<br>connection. The token cache/data store is still needed for resume and reconnection.<br>fig. 02WebSocket design
flowchart LR<br>H[human] -->|open websocket| S[Server] --> L[llm]:::accent<br>L -.->|sse tokens| S<br>DB[(token cache)]<br>S -.->|store tokens| DB<br>S -.->|read tokens| DB<br>S -.->|stream tokens on websocket| H<br>classDef accent stroke:#7eb8a4,stroke-width:1.5px<br>The same as the sse design, but with a websocket connectionWhy this works with SSE<br>Most peoples system design is based around the idea that servers are stateless, and all the state is<br>stored in a database. This allows the servers to horizontally scale to handle more requests, and<br>allows for better scaling as any server can handle any request. There’s generally a load balancer in<br>front of the servers that routes requests to them. Mostly that load balancer will be based on<br>sharing the load across servers rather than any kind of sticky session or session affinity.<br>Turns out, SSE drops into this architecture really nicely. The client makes a POST request to the<br>server with a prompt, and gets a stream ID back. The client then connects to that stream and gets<br>the token streaming response.<br>Any server can handle the original prompt request, and any server can handle the stream response<br>because tokens are threaded through the database.<br>WebSockets is basically exactly the same, except the client opens a WebSocket connection and sends<br>the prompt as a message on the connection. The response tokens are sent back on the same connection.<br>The connection is longer-lived, and you have to build your own message/request framing layer for the<br>shape of messages sent back and forth on the WebSocket. So WebSockets are more complex to build and<br>maintain, and don’t really add any value in this architecture.<br>So it seems like an obvious choice to go with SSE for streaming your response tokens.<br>What other features do you need in a production deployment?<br>So far we’ve got a design for streaming tokens to the client in the ‘happy path’; where nothing goes<br>wrong. We’ve got the architecture to support the possible failure cases and how to recover from<br>them, but we don’t have the features yet. So what else do you need to do to make those features<br>possible?<br>Reconnection, resume, and recovery<br>The client needs to be able to reconnect to the stream if the connection drops, and resume from the<br>last token that was sent to the client. This is where the token cache/data store comes in. The<br>server needs to store the tokens so that a reconnecting client can pick up where it was before. But<br>the client also needs to be able to indicate where it got to before the disconnect, and that needs<br>to correlate with some position on the server.<br>10<br>11<br>12<br>13<br>14<br>15<br>"type": "text-delta",<br>"delta": "hello",<br>"position": 1<br>"type": "text-delta",<br>"delta": " world",<br>"position": 2<br>"type": "text-delta",<br>"delta": "!",<br>"position": 3
All the tokens need to a sequence ID in the stream, so that when the client reconnects it can say<br>“the last token I got was at position 2”, and the server can then send tokens starting at position<br>3.<br>The SSE spec has Last-Event-ID header built into it for this purpose, but you still need to build<br>the plumbing to support it on the server and client side.<br>Detecting dropped connections<br>The server and client need to be able to detect when the connection has dropped, so that the client<br>can attempt to reconnect. To do this, the server has to send a heartbeat message every so often<br>(e.g. every 10 seconds) to keep the connection alive and to...