In previous blog posts I’ve mention my dislike for WebSockets. In this quick blog post I attempt to explain why. Unfortunately I think this raises more questions than it answers!
Be aware, these opinions are my own.
It’s the monitoring, stupid!
My dislike of WebSockets doesn’t stem from any of their implementation details, performance issues or general reinvention of the wheel (see this post for those critiques). It comes from a simple operational concern: How am I supposed to monitor WebSockets? What does it mean to say a WebSocket based service is “working”?
With RPC based services, we have a framework to think about how to monitor them – the RED method – where we measure the QPS, error rate and request latency. We set SLOs based on maximum tolerable latency (or some percentile of it), and the maximum tolerable error rate. If the service breaches these SLOs, we alert. We have an (albeit crude) way of deciding if an RPC based service is working (is it alerting?) and even better, we can do this with virtually no knowledge of what the service does, allowing our operations team (ie the developers) to “scale”.
The same cannot be said of WebSockets. As long-lived connections, measuring the HTTP-level request duration is meaningless. And typically the QPS of WebSockets based services is low when compared to RPC based services – O(number of active users / session length) vs O(number of active users). Monitoring becomes at best service-specific, and at worse completely absent.
How am I supposed to build and operate reliable services if I can’t monitor them?
Why do I hate WebSockets so much?
This one is easy – numerous production incidents on Weave Cloud have stemmed from our inability to monitor if WebSocket based services are “working”. We’ve promoted bad releases to production. We’ve introduced broken code into our frontends which then fail to proxy WebSocket connections correctly. And whilst WebSocket “requests” do have HTTP status codes, in our experience they have generally not been useful – the common failure modes are for them to error in the 400 range, something we don’t consider to be a backend error and therefore don’t alert on.
Were these issues resolved? Sure. Should we be more careful when promoting builds to prod? Vigilance doesn’t scale.
In my opinion the use of websockets can be generalized to represent an inversion in the direction of “control” vs the direction of connections. For example, you want to do RPCs from service A to service B, but service B is running somewhere service A cannot connect to. So you have service B proactively connect a websocket to service A, and then layer an RPC system on top of this connection. This could be a chat service “pushing” messages to an app running in your browser, or the Scope App sending “control” RPCs to the Probe, or Weave Cloud sending RPCs to Fluxd.
Don’t use WebSockets
So what should I use? Unfortunately I don’t have a great answer here – perhaps you do?
On a cases by case basis I think you could do a few different things:
- Just don’t invert connection flow. Find other ways. For instance, why can’t the Flux service in Weave Cloud connect directly to the customers Kubernetes API servers? I would wager users who are willing to use a hosted CD service probably have a publicly accessible API server.
- Use Comet-style long polling – but don’t long poll. “Normal” poll, tolerate the latency, and use a reasonable scheme to backing off when there is less actively. This is particularly suitable for broadcast style messages where you don’t care about the response.
Potentially HTTP/2 might provide a higher level of abstraction such that we can build a common monitoring language for server initiated RPCs, but it’s early days.
Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.