The topic of how the services in Weave Cloud are instrumented came up in two discussions this week, so I thought I’d write a short blog post explaining all the moving parts.

Instrumenting Requests

For all services in Weave Cloud, we measure request rate, error rate and request duration (the three principles of the RED method). We use Prometheus to gather all the metrics, and we instrument our code to export them.

Most* services use the same library to achieve this: our HTTP and gRPC middleware library at github.com/weaveworks/common/middleware. It appears there are a bunch of meanings of the word middleware, but in this sense we mean self-contained code which independently acts on a request before or after your normal application handlers[1].

You can use this middleware like such:

package main

import (
    "net/http"

    "github.com/gorilla/mux"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/weaveworks/common/middleware"
)

var (
    RequestDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "request_duration_seconds",
        Help:    "Time (in seconds) spent serving HTTP requests.",
        Buckets: prometheus.DefBuckets,
    }, []string{"method", "route", "status_code", "ws"})
)

func init() {
    prometheus.MustRegister(RequestDuration)
}

func main() {
    router := mux.NewRouter()
    router.Path("/metrics").Handler(prometheus.Handler())
    router.Path("/").Handler(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("Hello world"))
    }))
    http.ListenAndServe(":8000", middleware.Instrument{
        Duration: RequestDuration,
    }.Wrap(router))
}

You need to expose the /metrics endpoint so Prometheus can gather the metrics. The middleware library also includes handy middleware to add logging, path rewriting, error handling and middleware chaining, but I won’t go into them here.

We use a Prometheus histogram metric as it allows for us to aggregate latencies across multiple instances of the same service – Prometheus summaries do not. Only a single Prometheus histogram metric is required, as a histogram will also collect the number of requests, and we include a dimension for “status_code” such that they can be broken down by error rate. We also include a dimension to differentiate WebSocket requests (ws), so we can exclude them from our latency monitoring – but WebSockets are evil and should die.

Some queries you can now do on that metrics – assuming it is running as a service called hello in the default namespace:

  • sum(irate(request_duration_seconds_count{job="default/hello"}[1m])) by (status_code) – will give you the instantaneous QPS of the service, by status code, aggregated across all replicas.
  • sum(irate(request_duration_seconds_count{job="default/hello", status_code=~"5.."}[1m])) – will give you the rate of requests returning 500s.
  • sum(irate(request_duration_seconds_count{job="default/hello"}[1m])) by (instance) – will give you the QPS per instance, useful for checking if load balancing is working.
  • sum(rate(request_duration_seconds_sum{job="default/hello",ws="false"}[5m])) / sum(rate(request_duration_seconds_count{job="default/hello",ws="false"}[5m])) – will give you a 5-minute moving average latency of requests to the service.
  • histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{job="default/hello",ws="false"}[5m])) by (le)) – will give you the 5-min moving 99th percentile request latency.

With these queries you can now build our standard Weave Cloud dashboards, featuring QPS by result on the left and latency on the right. We include the average latency not because you should ever take any notice of average latencies, but because it helps sanity check that the latency histogram is working correctly – it is relatively easy to get the histogram buckets wrong:

For services in Weave Cloud, we also tend to specify a namespace in the metric options that broadly aligns with the repo name and/or namespace the jobs run in. Prometheus will prefix these namespaces onto the metric name. Jobs in the default namespace come from service.git, so metrics tend to be prefixed with service_, but for Cortex and Scope the prefix is more predictable.

What about instrumenting things that aren’t requests?

Ah, good question! I’m going to assume you mean things like RPCs to services which aren’t instrumented – such as Postgres, or DynamoDB, or S3. Luckily, we also have a solution for this, although it’s a little more verbose. We have a wrapper in the github.com/weaveworks/common/instrument library which allows you to gather timings for pretty much any code block – here is an abbreviated example from Cortex, where we instrument queries to DynamoDB:

var (
    dynamoRequestDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Namespace: "cortex",
        Name:      "dynamo_request_duration_seconds",
        Help:      "Time spent doing DynamoDB requests.",
        // DynamoDB latency seems to range from a few ms to a few sec and is
        // important.  So use 8 buckets from 64us to 8s.
        Buckets: prometheus.ExponentialBuckets(0.000128, 4, 8),
    }, []string{"operation", "status_code"})
    ...
)

func init() {
    prometheus.MustRegister(dynamoRequestDuration)
    ...
}

func (d dynamoClientAdapter) BatchWrite(ctx context.Context, input WriteBatch) error {
    ...
    var resp *dynamodb.BatchWriteItemOutput
    err := instrument.TimeRequestHistogram(ctx, "DynamoDB.BatchWriteItem", dynamoRequestDuration, func(_ context.Context) error {
        var err error
        resp, err = d.DynamoDB.BatchWriteItem(&dynamodb.BatchWriteItemInput{
            RequestItems:           reqs,
            ...
        })
        return err
    })
    ...
}

The TimeRequestHistogram function takes and executes a callback for you, timing it and emitting a sample to the histogram. The histogram has dimensions for the operation (the string you provide to TimeRequestHistogram) and the error, which is munged from the error returned by the callback – in advanced usage you can provide an extra callback to override this munging.

As an added benefit, this function will also emit an OpenTracing span so these requests show up in any OpenTracing-compatible tracer – in the trace below, the FindUsersByID, ListOrganizationsForUserIDs and DynamoDB.QueryPages calls all come from function:


Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.

Try it out, join our online user group for free talks & trainings, and come and hang out with us on Slack.