One of the most important decisions to make when setting up Prometheus Monitoring is to decide on the type of information that is most important to collect about your app. The metrics you choose simplify troubleshooting and in root cause analysis when a problem does occur.
At Weave, we defined a system called the RED method that you may find useful. The main point of the RED method, which loosely follows the principals outlined in the Four Golden Signals focuses on measuring things that end-users care about when using your web services, like service performance or errors appearing in the web interface.
With the RED method, three key metrics are instrumented that monitor every microservice in your architecture:
- (Request) Rate - the number of requests, per second, your services are serving.
- (Request) Errors - the number of failed requests per second.
- (Request) Duration - The amount of time each request takes expressed as a time interval.
Rate, Errors and Duration therefore attempt to cover the most obvious web service issues. These metrics also encapsulate an error rate that is expressed as a proportion of request rate. Using these basic metrics most problems that an end user might face with your web app can be captured, such as how many errors there are and how slow their service is.
For even more detailed coverage, you may also include the Saturation metric, which refers ‘to the degree to which the resource has extra work that it can’t service’ (See Brendan Gregg’s USE Method ) but this is over and above the RED method.
Saturation is a more advanced use case and is not entirely necessary when monitoring web apps from a user’s perspective, but if you’ve mastered these first three metrics, then you can also include it to give you an even broader spectrum of metrics to draw from.
The USE method focuses more on monitoring the performance and is meant to be used as a starting point in determining the root cause of performance issues and other systemic bottlenecks.
Ideally, both the USE and the RED Methods should be used in together when monitoring your applications.
One advantage of the RED method is that it helps you to think about how to display information in your dashboards. With just these three metrics, you can standardize on the layout of your Grafana dashboards to make it even simpler to read for when there is a problem. For example, a possible layout might entail - two columns, one row per service, request & error rate on the left, latency on the right like this:
A Python library is also available to help you generate the dashboards: GrafanaLib.
For more information about integrating the Grafana Dashboard, see Integrating Grafana
It is fair to characterize the RED method as one that only works for request-driven services - it breaks down for batch-oriented or streaming services for instance. It is also not all-encompassing. There are times when you need to monitor other things - the USE Method for example is better applied to resources like host CPU & Memory, or caches.
Also an app developer may have more specific things to monitor that relate to the development of a microservice and this method also doesn’t address those specific needs.