At Weaveworks, we believe that being on-call is not primarily about operations, but rather about user experience.

When something goes wrong in production, the first question we ask is “What’s the user impact?”

When choosing what to alert on, we don’t look for things like CPU usage being too high, or some queue being too long. Instead, we alert on user-visible symptoms. For a web service, this typically means a high ratio of errors to total requests, or an unacceptably high long-tail latency. Here, we are strongly influenced by the ideas in Google’s SRE book, particularly the chapter on Monitoring Distributed Systems.

Why does this matter? I think it’s helpful to flip the question around. If one of your containers is using 100% of a CPU core on one of the nodes in your cluster, why do you care? It might point to a potential optimization, or a genuine user-facing problem further down the line, but is it important enough to interrupt whatever it is you’re currently doing in order to investigate? I’d say not.

If your alerts are always based on user-visible symptoms, then you know that when you get an alert, it is worthy of your attention.

We think this is such a good idea that we decided to make it official by encoding it into our own internal Prometheus configuration.

Here’s how we did it

Prometheus allows for alerts to have “annotations”, which can be used to store extra information about an alert.

We went through all of our alerts and made sure they had impact annotation.

For example, this alert catches errors in the Weave Cloud Explore service:

ALERT QueryErrorRate
  IF          job:scope_request_errors:rate1m{job="scope/query"} > 0.1
  FOR         5m
  LABELS      { severity="critical" }
  ANNOTATIONS {
    summary = "scope/query: high error rate",
    impact = "Users experiencing bugs in Weave Cloud Explore",
    description = "The query service has an error rate (response code >= 500) of  errors per second.",
    dashboardURL = "https://$REDACTED_INTERNAL_URL/grafana/dashboard/file/scope-services.json",
    playbookURL = "https://$REDACTED_INTERNAL_URL/PLAYBOOK.md#query",
  }

Notice the impact says:

  • who is affected (in this case, “users”)
  • what they are experiencing (“bugs”)
  • the scope of the impact (“in Weave Cloud Explore”)

This gives the on-call engineer enough information to make a judgment on how to react. If many things go wrong at once, the on-call engineer can (and does!) use the impact to decide which to deal with first.

Other impact annotations we have in our alerts include:

  • “No one can log in”
  • “Terminals and other controls are failing for Weave Cloud users”
  • “We are running an obsolete / inappropriate version of software. Probably no user impact, but you’ll want to check that.”
  • “Cluster nodes more vulnerable to security exploits. Eventually, no disk space left.”
  • “We cannot reliably respond to operational issues.”

This representative sample shows that while we would like to only alert on user-visible symptoms, there are occasionally operational concerns that we have to deal with.

The last impact statement comes up when what’s in our Git repo doesn’t match what’s in production, and is part of how we do GitOps (operations by pull request). Is is not user-facing, but something that the on-call engineer has to deal with.

So how does the on-call engineer actually see this impact?

Although serious alerts go to OpsGenie, we tend to see them first in Slack. Here’s what they look like:

RebootRequired alert from Weave Cloud internal slack

Getting Prometheus and Slack to work together to make these things requires a little bit of fiddling. Here’s the relevant clause from our Alertmanager configuration.

receivers:
- name: 'service-alerts-warning'
  slack_configs:
  - api_url: https://hooks.slack.com/services/$REDACTED_WEBHOOK_URL
    channel: cloud
    send_resolved: true
    username: prod-alert
    text: |
      *Impact*:    No impact defined. Please add one or disable this alert. 
       *Details*:  
       <|Playbook>  <|Dashboard> 
    title: |
       ()

No doubt if we had more time we could use Slack’s advanced formatting features to make this even nicer.

The great thing is that with all of this in place, we barely even have to ask “What’s the user impact?”, our tools tell us automatically.


For further reading we suggest: 

Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.

Try it out, join our online user group for free talks & trainings, and come and hang out with us on Slack.