This is a tale with many twists and turns, a tale of observation, analysis and optimisation, of elation and disappointment. It starts with disk space.
Wind back four years to get the background: Weaveworks created the Cortex project, which the CNCF have recently graduated to "incubating" status. Cortex is a time-series database system based on Prometheus. We run Cortex in production as part of Weave Cloud, ingesting billions of metrics from clusters all over the world and serving insight and analysis to their devops owners.
I spend one week out of four on SRE duties for Weave Cloud, responding to alerts and looking for ways to make the system run better. Lessons learned from this then feed into our Weave Kubernetes commercial product.
Coming on shift September 10th, I noticed that disk consumption by Cortex was higher than I remembered. We expect the product to grow over time, and thus to use more resources, but looking at the data there had been a marked jump a couple of weeks earlier, and consumption by all customers had jumped at the same time. It had to be caused by something at our end.
Bit more background: Cortex doesn’t write every sample to the store as it comes in; instead it compresses hours of data into “chunks” which are much more efficient to save and retrieve. But machines sometimes crash, and if we lost a server we wouldn’t want that to impact our users, so we replicate the data to three different servers. Distribution and segmentation of time-series are very carefully arranged so that, when it comes time to flush each chunk to the store, the three copies are identical and we only store the data once.
Reason I’m telling you this is, by looking at statistics about the store, I could see this was where the increased disk consumption was coming from: the copies very often did not match, so data was stored more than once. This chart shows the percentage of chunks detected as identical: on the left is from a month earlier, and on the right is the day when I started to look into the problem.
OK, what causes Cortex chunks to be non-identical? Over to Jaeger to see inside a single ‘push’ operation:
The Cortex distributor replicates incoming data, sending it to ingesters which compress and eventually store the chunks. Somehow, calls to ingesters were not being served within the two second deadline that the distributor imposes.
Well that was a surprise, because we pay a lot of attention to latency, particularly the “p99 latency” that tells you the one-in-a-hundred situation. P99 is a good proxy for what customers may occasionally experience, and particularly notable if it’s trending worse. Here’s the chart for September 10th - not bad, eh?
But, salutary lesson: Histograms Can Hide Stuff. Let’s see what the 99.9th centile looks like:
So one in a thousand operations take over ten times as long as the p99 case! By the way, this is the “tail latency” in the title of this blog post: as we look further and further out into the tail of the distribution, we can find nasty surprises.
That’s latency reported on the serving side; from the calling side it’s clearer we have a problem, but unfortunately the histogram buckets here only go up to 1 second:
Here’s a chart showing the rate of deadline-exceeded events that day: for each one of these the data samples don’t reach one of the replicas, leading to the chunks-not-identical issue:
It’s a very small fraction of the overall throughput, but enough to drive up our disk consumption by 50%.
OK, what was causing these slow response times? I love a good mystery, so I threw myself into finding the answer. I looked at:
- Overloading. I added extra CPUs and RAM to our cloud deployment, but still the occasional delays continued.
- Locking. Go has a mutex profile, and after staring at it for long enough I figured it just wasn’t showing me any hundred-millisecond delays that would account for the behaviour.
- Blocking. Go has this kind of profile too, which shows when one part of the program is hanging around waiting for something like IO, but it turns out this describes most of Cortex. Nothing learned here.
I looked for long-running operations which could be chewing up resources inside the ingester; one in particular from our Weave Cloud dashboard service was easily cached, so I did that, but still no great improvement.
One of my rules of thumb when trying to improve software performance is “It’s always memory”. (Perhaps cribbed from Richard Stiles’ “It's the Memory, Stupid!”, but he was talking about microprocessor design). Anyway, looking at heap profiles threw up one candidate: the buffers used to stream data for queries could be re-used. I implemented that and the results looked good in the staging area, so I rolled it out to production. Here’s what I saw in the dashboard; rollout started at 10:36GMT:
I was ecstatic. Problem solved!
Let’s just open out that timescale a little. A couple of hours after the symptom went away, it was back again! Maybe only half as bad, but I wanted it fixed, not half-fixed.
OK, what do we do when we can’t solve a performance problem? We stare at the screen for hours and hours until inspiration strikes.
It had been great for a couple of hours. What changed? Maybe some customer behaviour - maybe someone started looking at a particular page around 12:30?
Suddenly it hit me. The times when performance was good lined up with the times that DynamoDB was throttling Cortex.
What the? That can’t possibly be right.
About throttling: AWS charges for DynamoDB both by storage and by IO operations per second, and it’s most cost-effective if you can match the IO provision to demand. If you try to go faster than what you’re paying for, DynamoDB will throttle your requests, but because Cortex is already holding a lot of data in memory we don’t mind going slowly for a bit. The peaks and troughs even out and we get everything stored over time.
So that last chart above shows the peaks, when DynamoDB was throttling, and the troughs, when it wasn’t, and those different regions match up exactly to periods of high latency and low latency.
Still doesn’t make sense. The DB storage side of Cortex runs completely asynchronously to the input side, which is where the latency was.
Well, no matter how impossible it seemed, there had to be some connection. What happens inside Cortex when DynamoDB throttles a write? Cortex waits for a bit then retries the operation. And it hit me: when there is no throttling, there is no waiting. Cortex will fire chunks into DynamoDB as fast as it will take them, and that can be pretty darn fast. Cortex triggers those writes from a timer - we cut chunks at maximum 8 hours - and that timer runs once a minute.
In the non-throttled case there would be a burst of intense activity at the start of every minute, followed by a long period where things were relatively quiet. If we zoom right in to a single ingester we can see this in the metrics, going into a throttled period around 10:48:
Proposed solution: add some delays to spread out the work when DynamoDB isn’t throttling. We already use a rate-limiter from Google elsewhere in Cortex, so all I had to do was compute a rate which would allow all queued chunks to be written in exactly a minute. The code for that still needs a little tweaking as I write this post.
That new rate-limiting code rolled out September 16th, and I was very pleased to see that the latency went down and this time it stayed down:
And the rate at which chunks are found identical, which brings down disk consumption, doesn’t recover until 8 hours after a rollout, but it’s now pretty much nailed at 66% where it should be: