In this post we discuss how to configure Prometheus to monitor your Kubernetes applications and services, and some best practices that we have developed around instrumenting your applications in a consistent fashion based on our experience using Prometheus to monitor the service behind Weave Cloud. The results have been a lower cognitive load, less alerting fatigue and a happier oncall.

This is the third post in our series on Prometheus and Kubernetes – see “A Perfect Match” and “Deploying” for the previous instalments.

TLDR:

  • We model Kubernetes pods and services as Prometheus instances and jobs using a combination of specific service discovery configuration and relabelling rules.
  • We made Prometheus monitoring the default for our cluster, added various relabelling rules and service annotations to support exporters and monitoring opt out.
  • We made the Prometheus job name contain the Kubernetes service namespace, to prevent accidental aggregation across jobs in different namespaces.
  • We use the RED method to consistently instrument our services and build dashboards and alerts

Mapping the Kubernetes Model into Prometheus

Firstly, we need to decide how to map the Kubernetes object model (Containers, Pods, Services, Nodes etc) into Prometheus. The Prometheus object model is very flexible and abstract; it consists of a set of time series, identified by a unique set of label-value pairs, with an associated series of timestamped values. Some of these labels are more equal than others – the job label and the instance label being my favourite. By setting these labels, we arrange Prometheus’ targets page to accurately reflect the topology of our application, and allow us to see any immediate problems at a glance. We chose to model Kubernetes services as Jobs in Prometheus, and individual Pods as Instances.

The Prometheus Kubernetes service discovery module allows you to discover a range of kinds of objects in the Kubernetes’ model. You specify the kind of the object in the Prometheus config as the ‘role’. Each object of that kind will be discovered as a different scrape target, on which you then use relabeling rules to configure how to map your application into Prometheus:

kubernetes_sd_configs:
- api_servers:
  - 'https://kubernetes.default.svc.cluster.local'
  in_cluster: true
  role: endpoint

At Weaveworks we found the most useful role to be endpoints, as these allow us to group together the Pods which make up a particular service into the same Kubernetes job. To achieve this we use the following relabelling rules:

relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
  action: replace
  target_label: job

Since we completely invested in Prometheus as our monitoring system, we also chose to differ from the default configuration by making the scraping opt-out instead of opt-in:

relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
  action: drop
  regex: false

We also use Kubernetes namespaces extensively to group together sets of microservices into bigger services. Initially we modelled these as separate label-value pairs in Prometheus, but we found it was too easy for dashboards to accidentally aggregate across services in different namespaces, so we’re currently including the namespace in the job name:

relabel_configs:
- source_labels: [__meta_kubernetes_service_namespace, __meta_kubernetes_service_name]
  regex: (.+);(.+)
  action: replace
  replacement: $1/$2
  target_label: job

The final piece of the puzzle is to set up Kubernetes and Prometheus for the sidecar / exporter usecase. Again this is achieved using service annotations and rewriting rules. As such, the following service definition:

apiVersion: v1
kind: Service
metadata:
  name: memcached
  annotations:
    prometheus.io.port: "9150"
spec:
  ports:
  - name: memcached
    port: 11211
  - name: prom
    port: 9150
  selector:
    name: memcached

Is combined with the following rewrite rule:

- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
  action: replace
  target_label: __address__
  regex: ^(.+)(?::\d+);(\d+)$
  replacement: $1:$2

…tells Prometheus the scrape target should be the sidecar, not the contain itself.

The RED Method

In a series of talks given this summer (slides, video) I proposed, somewhat tongue-in-cheek, the ‘RED‘ method for microservices monitoring, to complement the USE Method as described by Brendan Gregg. The RED method states that you should treat services as the ‘primary keys’ for your monitoring, not hosts, processes or replicas. You should monitor a consistent set of metrics for every service: Request rate, Error rate and request Duration, hence the RED backronym. With this information you can design alerts for symptoms, not causes, and align well with business-level metrics and prevent alert fatigue.

Here is a small example of how to do RED style monitoring with Go and Prometheus:

var rpcDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
    Name: "rpc_duration_seconds",
    Help: "RPC latency distributions.",
    Buckets: prometheus.DefaultBuckets,
})

func init() {
    prometheus.MustRegister(rpcDuration)
}

func main() {
    http.HandleFunc("/", instr.Wrap(func(w http.ResponseWriter, r *http.Request) {
        begin := time.Now()
        fmt.Fprintf(w, "Hello, %q", html.EscapeString(r.URL.Path))
        rpcDuration.WithLabelValues(r.Method).Observe(time.Since(begin).Seconds())
    })
}

Most intra-service RPCs in Weave Cloud are plain old HTTP, and as such we’ve extracted the logic for monitoring HTTP handler into a little piece of middleware. Also, as a lot of the features in Weave Cloud are built on websockets, we found it useful to differentiate websocket requests from non-websocket requests. This middleware can be found in the Weave Scope repo, and can be used like this:

import "github.com/weaveworks/scope/common/middleware"

var rpcDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
    Name: "rpc_duration_seconds",
    Help: "RPC latency distributions.",
    Buckets: prometheus.DefaultBuckets,
}, []string{"method", "route", "status_code", "ws"}))

func init() {
    prometheus.MustRegister(rpcDuration)
}

func main() {
    instr := middleware.Instrument{
        Duration: rpcDuration,
    }
    http.HandleFunc("/", instr.Wrap(func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "Hello, %q", html.EscapeString(r.URL.Path))
    })
}

With this kind of monitoring, it becomes relatively straightforward to build alerting rules. At Weaveworks we alert based on latency and error rate, which can be calculated like this:

(sum(rate(rpc_duration_seconds{job="default/frontend",status_code=~"5.."}[1m]))) /
(sum(rate(rpc_duration_seconds{job="default/frontend"}[1m])))

We found it useful to define a recording rules for this, broken down by service. Recording rules should follow the naming best practices, which may seem onerous at first but will come in useful as you build increasingly more complex monitoring. With this recording rule defined, our alert rules looks relatively simple:

job:request_errors:rate1m =
    (sum(rate(rpc_duration_seconds{status_code=~"5.."}[1m])) by (job)) /
    (sum(rate(rpc_duration_seconds[1m])) by (job))

ALERT FrontendErrorRate
  IF job:request_errors:rate1m{job="default/frontend"} > 0.1
  FOR 5m
  LABELS { severity="critical" }
  ANNOTATIONS {
    summary = "frontend service: high error rate",
    description = "The frontend service has an error rate (response code >= 500) of {{$value}} errors per second.",
  }

Similarly we define recording and alerting rules for latency, which has the added benefit of making their evaluation much quicker as we also use the same recording rules in our dashboards, for consistency.

job:request_duration_seconds:99quantile =
    histogram_quantile(0.99, sum(rate(rpc_durations_histogram_seconds_bucket{ws="false"}[5m])) by (le,job))
ALERT FrontendLatency
  IF job:request_duration_seconds:99quantile > 5.0
  FOR 5m
  LABELS { severity="warning" }
  ANNOTATIONS {
    summary = "frontend service: high latency",
    description = "The frontend service has a 99th-quantile latency of {{$value}} ms.",
  }

Designing Effective Dashboards

Once you’ve started collecting and alerting on all these useful metrics, you will want to draw some pretty charts. As mentioned in previous blog posts, and inline with the majority of the Prometheus community, we use Grafana. Our dashboards are not particularly sophisticated (yet!), but some basic principles have allowed us to build relatively easy to understand and consistent dashboards.

Firstly, all of our dashboards are two columns – the left-hand column being request rates broken down by result (success, error), and the right-hand column being various latency quantiles – 99th and 50th percentile typically. We make sure the series for successful requests is green and the series for unsuccessful requests is red, and enforce this all with a simple Python lint script for our Grafana dashboards. This allows you to easily recognise exceptional situation at a glance, although we should choose different colors – red and green is not very friendly toward color blind individuals.

Secondly, we organise our dashboards as a row-per-service, and order the rows in a breadth-first traversal of dependant services (imagining for a moment that they form a nice tree). As mentioned above, we use Kubernetes namespaces to separate out related micro-services and for larger “macro” services. We’ve tended towards having a different dashboard per namespace. This also broadly aligns with our team structure in Weaveworks, as predicted by Conway’s law.

Using Prometheus to monitor your application on Kubernetes

In summary, at Weaveworks we’ve discovered a series of practices for using Prometheus to monitoring our Kubernetes services:

  • We model Kubernetes pods and services as Prometheus instances and jobs using a combination of specific service discovery configuration and relabelling rules.
  • We made Prometheus monitoring the default for our cluster, added various relabelling rules and service annotations to support exporters and monitoring opt out.
  • We made the Prometheus job name contain the Kubernetes service namespace, to prevent accidental aggregation across jobs in different namespaces.
  • We use the RED method to consistently instrument our services and build dashboards and alerts.

For additional reading on Prometheus, check out my other blog posts in this series:

Download our white paper “Application Monitoring with Weave Cortex: Getting the Most out of Prometheus as a Service” to learn more about Prometheus.