Weave Cloud Outage 28th Feb Post-Mortem
Weave Cloud is our service for deploying, exploring, and monitoring microservice-based applications. At 17:37 GMT on 28th Feb, Weave Cloud suffered an outage for over 4 hours. The expectation of availablilty on a service like Weave Cloud...
Weave Cloud is our service for deploying, exploring, and monitoring microservice-based applications. At 17:37 GMT on 28th Feb, Weave Cloud suffered an outage for over 4 hours. The expectation of availablilty on a service like Weave Cloud is high, and we fully intend to learn from and improve our service based on this experience.
In this blog post we’ll discuss what caused the outage, how we handled the incident and what we’re going to do to make sure this kind of outage cannot happen again. We’ll also attempt to explain some of the decisions that lead to the current situation.
The overall Weave Cloud outage lasted 4:23hrs. For the first 1:08hr the Weave Cloud UI could not be loaded. For the entire outage, Scope reports send to Weave Cloud were lost. No Cortex data loss, but Cortex was unavailable for queries for ~3:20hrs as S3 GETs recovered sooner than S3 PUTs.
The outage did not effect Flux, although as Docker Hub and Quay.io were affected Flux did not deploy any images during this time.
The root cause of the Weave Cloud outage was the total outage of AWS S3 in the us-east-1 region, where Weave Cloud is located. Amazon have published a brief post-mortem of the S3 incident:
At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
We are also making our internal post-mortem document publically available for comment.
The following graphs show our QPS and request latency from Cortex to S3 during the incident:
Weave Cloud UI
When we launched Weave Cloud, we served our UI assets from a replicated Nginx service hosted on our Kubernetes cluster. These assets were baked into the container image by our CI pipeline, so we could do rolling upgrades and roll backs on the UI by doing rolling updates and roll backs on the Nginx service as we do with all our other backend services.
As a result of this design, rolling upgrades of our Nginx service could result in page load errors. Our UI assets are all named after their content hash for cache busting. If, during a rolling upgrade, an
index.html was loaded from one version of the Nginx containers while request for the assets were sent to others, the assets could fail to load. To solve this issue, we started uploading the assets to S3 as part of our CI pipeline, and referencing that bucket from our
index.html, while continuing to serve the
index.html from Nginx. This gave us a system for doing rolling upgrades and rollbacks without any potential for page load errors.
When we became aware of the S3 outage, our first discovery was that no one could load https://cloud.weave.works. We were able to revert to the old system, rebuild a container image with all the assets embedded and then upload this to the cluster, which restored service for the Weave Cloud UI. This process itself presented multiple challenges as CircleCI (our CI service) was affected by the S3 outage, as was Docker Hub (where the Nginx base image is hosted) and Quay.io (where we host our private images). Luckily for us, one of our engineers had locally cached versions of the required images, and was able to jury rig up a script to manually push the new Nginx image to each minion in our Kubernetes cluster.
Luck is not a strategy, so we are going to script up and test the manual-push step, and automate a process for ensuring the oncall has all the required images available locally to build any of our services. This should be relatively straightforward to do, as the dependencies are all declared in our Dockerfiles.
The hosted version of Weave Scope that we run as part of Weave Cloud relies on S3 to store each report that is sent to us by a user. The original design doc goes into more detail. Scope also uses the same trick as the Cloud UI to serve its static assets from an S3 bucket, but because Scope is also available as a standalone open source project, a flag exists to change this behaviour such that Scope served the assets itself. This flag change was quickly rolled out.
Scope uses a Memcache cluster in front of S3 to store and accelerate report fetches, in such a way that all reports that are ever needed to serve the UI are kept in memcache. Unfortunately, reports are only written to this memcache once they have been successfully written to S3. We considered disabling writes to S3 in the code completely, but given the challenges we had building and pushing code we deemed the change too risky. We plan on introducing “degraded” mode reads and writes into Scope in the near future, to cope with temporary outages in S3. Follow weaveworks/scope#2297 for progress on this.
Weave Cortex, our hosted Prometheus as a Service, relies on S3 to store compressed, encoded timeseries data (“chunks”). The design of Weave Cortex batches samples in memory for a period of a few hours before flushing to S3. As we couldn’t flush to S3, pending chunks backed up in memory during this outage but no data was lost. Due to Prometheus’ incredably efficient compression scheme, this didn’t even use that much memory:
On the read side, Cortex’s memcaches are not quite as effective as Scope’s (yet). This resulted the odd chunk read being required from S3 for each query, which caused the queries to fail. As such, Cortex was unavailable for queries until around 21:10. We intend on introducing a similar “degraded” mode read into Cortex in the near future to work around this, see: weaveworks/cortex#309.
The ideas outlined above will allow us to tolerate future outages in S3 in a single region by falling back to a degraded mode of operation. Clearly, the longer term fix is to move towards operating Weave Cloud out of multiple independent regions, which should allow us to maintain full operation even if a single region fails. This is a major undertaking and not something we are going to jump straight into, but design and planning for this has begun. We are even considering having the second region be on a different service provider, such as Google Cloud Platform.