The RED Method: key metrics for microservices architecture
Tom Wilkie shares Weaveworks monitoring philosophy and the three most important metrics to use in your microservices architecture.
Putting Helm at the Heart of your GitOps Pipeline
Introducing EKS support in Cluster API
Living on the Edge - How Screenly Monitors Edge IoT Devices with Prometheus
In previous blog posts and talks I’ve alluded to The RED Method, our monitoring philosophy at Weaveworks. In this blog I’ll outline what The RED Method is, why we use it, and where it comes from.
What is The RED Method?
The RED Method defines the three key metrics you should measure for every microservice in your architecture. Those metrics are:
- (Request) Rate - the number of requests, per second, you services are serving.
- (Request) Errors - the number of failed requests per second.
- (Request) Duration - distributions of the amount of time each request takes.
Measuring these metrics is pretty straightforward, especially when using tools like Prometheus (or Weave Cloud’s hosted Prometheus service). I’ve already written a blog post on how we instrument our services in Weave Cloud, so I’m not going to cover it here.
Another nice aspect of the RED method is that it helps you think about how to build your dashboards. You should bring these three metrics front-and-center for each service and error rate should be expressed as a proportion of request rate. At Weaveworks, we settled on a pretty standard format for our dashboards - two columns, one row per service, request & error rate on the left, latency on the right:
We’ve even built a Python library to help us generate these dashboards: GrafanaLib.
Why The RED Method?
Why should you measure the same metrics for every service? Surely each service is special? The benefits of treating each service the same, from a monitoring perspective, is scalability in your operations teams. Which, if you are like Weaveworks, means my fellow developers and me.
What does scalability of an operations team mean? I look at this from the point of view of how many services a given team can support. In an ideal word, the number of services the team can support would be independent from its team size, but dependent on other factors - what kind of response SLA you want, whether you want 24/7 coverage etc. So how do you decouple the number of services you can support from the team size? By making every service look, feel and taste the same. This reduces the amount of service-specific training the team needs, and reduces the service-specific special cases the oncalls need to memorize for those high-pressure incident response scenarios - what has been referred to as “cognitive load.”
As an aside, if you treat all your services the same, many repetitive tasks become automatable. Capacity planning? Do it as a function of QPS and latency. Dashboards and alerts with links to playbook entries and those dashboards? Automatically generate them.
Where does The RED Method come from?
I can’t take any credit for this philosophy, as it is 100% based on what I learned as a Google SRE. Google calls it their “The Four Golden Signals”. The Google SRE book is a great read, and goes into way more depth that I can.
Google include an extra metric, Saturation, over and above the RED method. I don’t include Saturation because, in my opinion, it is a more advanced use case. I think the first three metrics are really the most important, and people remember things in threes… But it you’ve mastered the first three, by all means include Saturation.
The name “The RED Method” started life as a tongue-in-cheek play on Brendan Gregg’s USE Method - another recommended read. The USE Method was being circulated around the office when we started Weave Cloud and were discussing how to monitor it. I felt like I needed a catchy name for the monitoring strategy I was proposing, so The RED Method was born. The USE Method is a fantastic way to think about how to monitor resources; we use it as a framework to monitoring the infrastructure behind Weave Cloud. However, I think the abstraction becomes a little strained when talking about services.
It is fair to say this method only works for request-driven services - it breaks down for batch-oriented or streaming services for instance. It is also not all-encompassing. There are times you will need to monitoring other things - the USE Method is a great example when applied to resources like host CPU & Memory, or caches.
Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.