Cortex is our open source project which powers our hosted Prometheus-as-a-Service in Weave Cloud.

The Cortex write path consists of a ‘distributor’ which sends batches of samples in parallel to the ‘ingesters’. The ingester 99th percentile write latency is pretty stable at about 8ms (the whole operation is in-memory), whereas the distributor 99th percentile write latency was hovering around 80ms (the 50th percentile was much closer to the ingester latency). Clearly something was wrong.

This post is as much a tour of the tools I used to investigate this than anything else. If you’re battling with high long tail latency, this might be worth a read.


This issue was initially surfaced by our monitoring; see previous blog post “Of Metrics And Middleware” for an overview of how Cortex (and almost all Weave Cloud services) are instrumented. The graphs of latency for the two services looked like this:

It’s perhaps worth noting that the layout of the dashboards was important in noticing this issue: we arrange the dashboards in a row per service, with (broadly speaking) the services at any point on the dashboard page speaking to the services below them. By stacking them like this, and having QPS on the left and latency on the right, we could quickly spot when and where in the request flow something is amiss because requests “flow” from the top to the bottom of the dashboard page.


The first step was to indeed check that the distributor was parallelising writes to the ingesters - and it was. My initial hypothesis was that the long tail latency at the distributor was in fact caused by a long tail latency in the ingesters, but due to the QPS amplification (every 1 write to a distributor does 3 writes to the ingesters) this wasn’t showing up in the ingester latency histograms. This lead to a PR which made some interesting use of the atomic package to allow the distributor to return early from writes that had been acknowledged by a quorum of ingesters, and therefore shouldn’t be as sensitive to ingester long tail latency - see weaveworks/cortex#272.

This change had virtually no effect on the distributor latency.

Distributed Tracing

It was at this point we broke out Loki, our OpenTracing/OpenZipkin/Prometheus mashup for distributed tracing. Over the past few weeks I’ve been quietly instrumenting our services for OpenTracing, plumbing contexts through where appropriate. Right now, the Users service, the AuthFE and all the Cortex services are instrumented:

I dutifully went and filed issue weaveworks/cortex#289 with this trace.

The trace pointed to a large gap processing the push request before the distributor sends the request to the ingesters. A hunch told me the way we calculated the ingester to send each sample to, on a sample-by-sample basis, was slow. So I wrote a quick Go benchmark test for this in weaveworks/cortex#290, but the results showed this wasn’t remotely slow:

BenchmarkRing-4          10000        108009 ns/op

Strike two. I hoped that a CPU profile would tell me what was going on, so I snuck the et/http/pprof stanza into the same PR.

CPU Profiling

The easiest way I’ve found to profile Go program running on a Kubernetes cluster is to use kubectl port-forward and then collect the trace locally:

$ kubectl port-forward --namespace=cortex distributor-2457666734-2665o 8000:80 &

$ go tool pprof cmd/distributor/distributor  http://localhost:8000/debug/pprof/profile
Fetching profile from http://localhost:8000/debug/pprof/profile
Please wait... (30s)
Saved profile in /Users/twilkie/pprof/pprof.distributor.localhost:8000.samples.cpu.002.pb.gz
Entering interactive mode (type "help" for commands)

This works nicely as the Cortex binaries in my source tree were built in a Docker container, and were the actual binaries I was running in dev. As they are built in a container, the profiler expected the source code to be in /go/src, so a quick hacky symlink made pprof happy:

ln -s $GOPATH /go

The canonical reference for CPU profiling is the Go blog on the topic. The commands I find the most useful are (NB not the actual results, I’ve lost that):

(pprof) top10 -cum
0.06s of 5.37s total ( 1.12%)
Dropped 232 nodes (cum <= 0.03s)
Showing top 10 nodes out of 264 (cum >= 1.55s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%      4.76s 88.64%  runtime.goexit
         0     0%     0%      2.29s 42.64%
         0     0%     0%      2.29s 42.64%
         0     0%     0%      1.71s 31.84%*Distributor).Push.func2
     0.02s  0.37%  0.37%      1.71s 31.84%*Distributor).sendSamples
     0.02s  0.37%  0.74%      1.68s 31.28%*Distributor).sendSamplesErr
     0.02s  0.37%  1.12%      1.62s 30.17%  runtime.systemstack
         0     0%  1.12%      1.55s 28.86%*ingesterClient).Push
         0     0%  1.12%      1.55s 28.86%*Distributor).sendSamplesErr.func1
         0     0%  1.12%      1.55s 28.86%*ingesterClient).Push

And rendering the trace as a PDF (this is the actual render):

(pprof) pdf >cpu.pdf

The trace shows about 30% of the CPU is being used by marshalling, and fair amount of GC too. But ProtoBuffers are supposed to be quick, right?

Distributed Tracing II

Slightly baffled by this, I decided to wrap any operation I thought might be even remotely expensive with instument.TimeRequestHistogram to see what’s was happening in this gap:

This trace clearly indicated that the most time was being lost deserialising the ProtoBuffer-formatted request. It was at this point I went and investigated the generated code behind our ProtoBuffers, only to discover that only type definitions are generated, and that the marshaling code still relies on reflect. A quick googling later, I discovered a project called gogoprotobuf which will generate the marshalling code for you.

Gogoprotobuf has a series of protobuf compilers which progressively add various optimisations - the base one is pretty much the same as the Google protobuf compiler, whilst the most agressive one generates marshalling functions, equality functions, comparison functions, stringers etc. I immediately turned on the most aggressive compiler.

Still no reduction in the long tail.

At this point I started to notice the traces I was getting back from Loki were quite inconsistent - some showed no time lost in the ProtoBuffer code:

Combined with the CPU profile above, I started to think this could be GC-related. Gogoproto has a series of “extensions” which allow you to remove lots of pointers from the generated code, hopefully making it kinder on the garbage collector. I enabled all of these (which required some code changes on my end). Cortex also has an “internal” format, based on the original Prometheus datatypes, which we tend to convert the on-the-wire type to and from. Figuring this wasn’t healthy for the garbage collector either, I changed the distributor code paths to use the ProtoBuffers directly.

This resulted in a modest reduction in long tail, but it was difficult to discern from the noise. But I finally felt like I was getting somewhere…

Memory Profiling

The final tool I needed was the Go memory profiler. Its use is virtually identical to the CPU profiling above:

457666734-2665o 8000:80 &

$ go tool pprof cmd/distributor/distributor  http://localhost:8000/debug/pprof/heap
(pprof) top10
6909.53kB of 7933.63kB total (87.09%)
Dropped 320 nodes (cum <= 39.67kB)
Showing top 10 nodes out of 82 (cum >= 512.20kB)
      flat  flat%   sum%        cum   cum%
 1536.11kB 19.36% 19.36%  1536.11kB 19.36%*LabelPair).Unmarshal
 1184.27kB 14.93% 34.29%  1184.27kB 14.93%
  553.04kB  6.97% 41.26%  2089.15kB 26.33%
  528.17kB  6.66% 47.92%   528.17kB  6.66%
  528.17kB  6.66% 54.57%   528.17kB  6.66%  regexp.(*bitState).reset
  518.91kB  6.54% 61.12%   518.91kB  6.54%
  518.02kB  6.53% 67.64%   518.02kB  6.53%
  516.01kB  6.50% 74.15%   516.01kB  6.50%  runtime.allgadd
  514.63kB  6.49% 80.64%   514.63kB  6.49%*Distributor).sendSamplesErr
  512.20kB  6.46% 87.09%   512.20kB  6.46%  runtime.malg

(pprof) list \(\*LabelPair\).Unmarshal
Total: 7.75MB
ROUTINE ========================*LabelPair).Unmarshal in /go/src/
    1.50MB     1.50MB (flat, cum) 19.36% of Total
         .          .   3265:            }
         .          .   3266:            postIndex := iNdEx + intStringLen
         .          .   3267:            if postIndex > l {
         .          .   3268:                return io.ErrUnexpectedEOF
         .          .   3269:            }
  512.01kB   512.01kB   3270:            m.Name = string(dAtA[iNdEx:postIndex])
         .          .   3271:            iNdEx = postIndex
         .          .   3272:        case 2:
         .          .   3273:            if wireType != 2 {
         .          .   3274:                return fmt.Errorf("proto: wrong wireType = %d for field Value", wireType)
         .          .   3275:            }
         .          .   3276:            var stringLen uint64
         .          .   3277:            for shift := uint(0); ; shift += 7 {
         .          .   3278:                if shift >= 64 {
         .          .   3279:                    return ErrIntOverflowCortex
         .          .   3280:                }
         .          .   3281:                if iNdEx >= l {
         .          .   3282:                    return io.ErrUnexpectedEOF
         .          .   3283:                }
         .          .   3284:                b := dAtA[iNdEx]
         .          .   3285:                iNdEx++
         .          .   3286:                stringLen |= (uint64(b) & 0x7F) << shift
         .          .   3287:                if b < 0x80 {
         .          .   3288:                    break
         .          .   3289:                }
         .          .   3290:            }
         .          .   3291:            intStringLen := int(stringLen)
         .          .   3292:            if intStringLen < 0 {
         .          .   3293:                return ErrInvalidLengthCortex
         .          .   3294:            }
         .          .   3295:            postIndex := iNdEx + intStringLen
         .          .   3296:            if postIndex > l {
         .          .   3297:                return io.ErrUnexpectedEOF
         .          .   3298:            }
       1MB        1MB   3299:            m.Value = string(dAtA[iNdEx:postIndex])
         .          .   3300:            iNdEx = postIndex
         .          .   3301:        default:
         .          .   3302:            iNdEx = preIndex
         .          .   3303:            skippy, err := skipCortex(dAtA[iNdEx:])
         .          .   3304:            if err != nil {

The profile made it pretty clear that the unmarshalling in (*LabelPair).Unmarshal was the main source of memory usage. In Go converting a []byte to a string involves a copy, as strings are immutable but []bytes are not. The Prometheus remote write format sends batches of 100 samples, along with all their labels. On average, each sample has 7 label pairs - so each RPC has around 1400 small strings. We were doing ~120 QPS in the dev environment - so 168,000 small string copies a second, or about 8MB/s guessing each string is 50 bytes. Surely Go’s GC could deal with that?

Zero Copy

There are various tricks you can do to remove this copy, if you know the underlying buffer won’t be modified whilst using the string (which we did). Gogoprotobuf has an “unsafe” mode which I was expecting would avoid the copy, but it doesn’t. On the wire, ProtoBuffer strings are exactly the same as byte arrays - but unfortunately Gogoprotobuf does a copy for these also (they are erring on the side of safety):

m.Name = append(m.Name[:0], dAtA[iNdEx:postIndex]…)

Gogoprotobuf does allow for custom types though - so we can write a custom type which avoids the copy:

// Bytes exists to stop proto copying the byte array
type Bytes []byte

// Unmarshal updates Bytes to be data, without a copy
func (bs *Bytes) Unmarshal(data []byte) error {
    *bs = data
    return nil

Using this in the proto file is relatively straight forward:

message LabelPair {
  bytes name  = 1 [(gogoproto.customtype) = "", (gogoproto.nullable) = false];
  bytes value = 2 [(gogoproto.customtype) = "", (gogoproto.nullable) = false];

Did this work? I deployed and nervously mashed F5 watching the latency drop:

The 99th percentile latency eventually settled at 25ms. I couldn’t believe it. 4x latency reduction by not copying strings. Just to verify, I quickly rolled back to the master build to see how bad it was:

Go 1.8

The final twist in this story - Go 1.8 was released with claimed improvements to the garbage collector. We had to try it out:

It made it bloody slower.

Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.

Try it out, join our online user group for free talks & trainings, and come and hang out with us on Slack.