How I Halved the Storage of Cortex

September 12, 2019

Find out how Bryan Boreham managed to cut the storage of Cortex’ time -series data in half by re-architecting how the data gets split into chunks.

Related posts

6 Reasons to Start the Cloud Native Transformation

Introduction to Prometheus Monitoring

Prometheus and Kubernetes: Monitoring Your Applications

This is the story of a great improvement in version 0.2 of Cortex, released on September 5th 2019.

Cortex is a horizontally scalable system built on Prometheus, using a back-end such as Cassandra, DynamoDB or S3 to store time-series data. Weaveworks created Cortex to use as part of Weave Cloud, and it is now a CNCF project.

How Cortex manages time-series data

The ‘ingester’ component of Cortex acts as a write de-amplifier, fielding thousands of input requests for each output to the store. Time-series are built up in memory, and compressed. When they reach a maximum age such as 6 hours, they are sent to the store in ‘chunks’. Each chunk has an entry in the index that allows fast lookup by time and by the labels that identify each metric.

The data sits in memory while being compressed, and to continue working, if one or two of them fails, we replicate it to three different ingesters. This scheme works well, and Cortex stores trillions of data points in production for Weave Cloud.

ingester-store.png

Data replicas don’t match

However, in practice those three replicas of the data aren’t exactly the same - each chunk has a slightly different start and end time since the ingesters started at different times - so we end up with three copies of the data in the store, and de-duplicate them when the data is read back.

time-series-1.png

Many times I looked for a way to deduplicate the data, but it isn’t easy when you have billions of chunks to deal with. If we have each ingester stop and consider what has been written already before writing, this will hamper scalability - the golden rule is that components should have no interlocks or shared state.

Short time-series investigated as part of the solution

The first part of the solution came when I investigated what happened with very short time-series - for example a small cron job that fires up, runs for a minute, then shuts down again. Once its metrics reach Cortex, the chunks shouldn’t have a misalignment problem, since they all start when the ingesters are ready, and they’re not big enough to spill over into another chunk. But still, I found that Cortex was storing three copies! It turned out the labels inside the chunk would get written out in essentially random order, and the checksum we add to each chunk to ensure data integrity would come out different. Modifying the code to write labels in alphabetical order got round that problem. Cortex now writes the three copies out identically.

When you write multiple records under the same key to a store like Cassandra, DynamoDB, S3, etc., it will discard all but one of them. This fits great with how Cortex works, so overall this took out ⅔ of the storage usage for those short time-series. Yay!

Cortex can also be configured to write each chunk to Memcached after it’s gone to the store - Memcached is cheap and fast compared to cloud storage and so we use it to speed up queries. Considering the tradeoff between the 2-3ms it takes, versus the price per IO to the database, I made the ingester check if the key was in cache before sending a chunk to store. Now we avoid the disk IO for the extra copies that would get eliminated inside the store.

With all of this up and running, I could see that the overall benefit was about 8% - the vast majority of Weave Cloud customers’ time series do run on for days and days rather than being the short ones that were helped by these changes.

Handling long-running time series

To tackle long-running series, I hit upon the idea of changing the way they were split into chunks. Instead of splitting them 6 hours after they started, which is relative to when the ingester fired up, I made each time-series map to a fixed point in a 6-hour cycle. (PR #1578)

time-series-2.png

For the first 6 hours after startup, ingesters are flushing chunks that could be quite short, and don’t align on start time, but then after that things come into sync and all the start and end times line up and chunk writes drop by about 60%. The title says “halved” because we still need to index all the different series, so overall storage comes down 50%

So that’s it - one of the greatest optimizations I’ve ever done, and it’s quite simple when you see how.

The source code to Cortex is available on GitHub, or you can sign up for a free trial of our managed service at http://cloud.weave.works.


Related posts

6 Reasons to Start the Cloud Native Transformation

Introduction to Prometheus Monitoring

Prometheus and Kubernetes: Monitoring Your Applications

Are you production ready? Download the Production Ready Checklists to find out more.