In this guest post Neeraj Poddar, Lead Platform Architect at Aspen Mesh, takes us through their journey to running Cortex in production.
At Aspen Mesh, we have been running and managing Cortex in production for the last 7 months. It has become an important part of our offering and enables us to scale economically with our customers. If you’re not familiar with Cortex, here’s a quick overview.
Cortex is an open source project (now a CNCF sandbox project), initially started by the Weaveworks team to solve some of the challenges around running Prometheus at scale. As the project’s Github page states, Cortex is a “horizontally scalable, multi-tenant” Prometheus as a service. If you’re wondering why would you might want to use Cortex, this blog might help answer that question.
In this blog I’ll cover four key phases of the Cortex journey at Aspen Mesh, in the hope it can help others learn from our experiences.
- What were the challenges we were facing before using Cortex?
- Why did we choose Cortex?
- What has been our experience running Cortex in production at scale?
- What are we looking forward to with regards to the Cortex project and community?
Challenges Before Cortex
Before I go into the problems we were trying to solve with Cortex, let me give an overview of the problem we are trying to solve with Aspen Mesh. What we are building is an enterprise service mesh built on Istio which provides a SaaS platform that hosts Prometheus (and Jaeger and Grafana) for our customers. This means we need an architecture that can support 1000s of customer’s microservice data (metric, telemetry, config..) coming into our platform while maintaining the correct level of isolation and performance for each tenant. Additionally, we had the constraint of preserving Prometheus’ APIs since other parts of the system depended on them.
Need for multi-tenanted Prometheus
Our first crack at solving this problem was to deploy unique Prometheus instances for each tenant into their own isolated namespaces. Prometheus was built to be a highly optimized time-series database capable of storing data on attached persistent volumes. It is not designed to be consumed in a multi-tenant way as there is no concrete mechanism for isolation.
Given all these constraints and our requirements for hosting and managing Prometheus for our customers, it meant we had a large number of stateful Prometheus instances running in our cluster that would collect metric data for each tenant. This doesn’t seem that bad, right?
Secure gRPC remote scraping of tenants
Well, if you’re familiar with the Prometheus protocol, it is a pull based protocol, designed to auto-discover endpoints and scrape data at a configured regular interval. Since Prometheus was running in our cluster, it had to scrape data from our tenant’s clusters. We designed a remote scrape tunneling protocol built in gRPC that securely scrapes metric data from our tenants.
When we started doing this, there weren’t many solutions to solve this problem. Couple that with deadlines as well as in-depth networking expertise on our team, we decided to go down this route with our eyes wide open.
Challenges with Secure Remote Scraping
This rather hacky solution did work. We were able to securely get metric data from customer clusters. But as you can imagine, there were many challenges with this approach. In the end, we knew that a better strategy going forward was needed. Some of the major challenges were:
- Managing upgrades - persistent volumes and state for 1000s of Prometheus instances is not feasible at scale.
- Prometheus by design is a scale up monolith architecture. There is no easy way to scale out/up the query layer independent of the storage layer. This was a big deal for us as we anticipated different throughput requirements for querying and storing.
- Prometheus has native integrations with Kubernetes and is built for the auto-discovery of nodes, Pods, service, etc. Implementing a remote scrape protocol takes away this capability. We could make this auto discovery part of our custom protocol but that required reimplementing a lot of built-in Prometheus functionality.
- Even though Prometheus is optimized to store data on disks, we wanted to use managed solutions for all of our storage needs on our platform.
Key Reasons We Chose Cortex
We started investigating other solutions that provide a multi-tenant, highly scalable solution for Prometheus, ones that could store data in managed databases like AWS DynamoDB and Bigtable, instead of persistent volumes. We were aware of the Prometheus remote protocol and the new integrations that were rapidly getting added in the community. As part of our investigation we found Cortex and its original design document for the aptly named Project Frankenstein. It was clear that Cortex solves similar problems we were facing, and that the architecture supported most (if not all) of our requirements.
The key reasons we chose Cortex were:
- Cortex relies on the Prometheus remote storage protocol. This means we can run Prometheus in our customer’s clusters as stateless applications and get the benefits of auto-discovering endpoints.
- Cortex integrates with various managed backends for the long term storage of data. We wanted to use AWS DynamoDB which is well supported.
- Cortex is designed to be a distributed system built on microservices principles which makes it easy to scale out as needed. You can scale storing independent of querying.
- It has a small but vibrant Open Source community. Cortex originated at Weaveworks and I had previously used other projects openly built by Weaveworks which gave us confidence on the quality and viability of the project.
- It is used in production by companies providing Prometheus as a service as part of their offering like Weaveworks and Grafana Labs.
Experience with Cortex
Our experience running and managing Cortex so far has been great, but not without a few hiccups. Because of its complexity, it is not easy to get started and quickly verify Cortex functionality. As I was struggling to understand the myriad of options and how to tune the performance for production traffic, I reached out to the community for help. One of the co-creators of Cortex had a call with me to share his best practices for running Cortex in production.
Here are a few steps that we went through that are key to running Cortex at scale:
- Create a Helm chart for Cortex which makes it easy to iterate (verifying arguments, switching between development and production environments) and upgrading in production as needed.
- Build an authentication proxy to validate the metrics coming in from various tenants and add the appropriate Cortex organization HTTP headers to support multi-tenancy.
- Identify and add alerts for various Cortex components so you can scale out as required.
Cortex is now a CNCF sandbox project which will speed community growth and bring more structure to the project. There are various enhancements (some are already underway) that I am excited to see as they will help Aspen Mesh and likely other organizations using Cortex:
- Ability to run Cortex on a service mesh like Istio or Aspen Mesh for better control. I have tried this with partial success and would love to make this a configurable Helm option.
- Better organization of options and arguments allowed in various Cortex services. The community has done a great job of maintaining backwards compatibility by enabling feature flags, but the sheer number of options can be overwhelming for first-time users.
- Performance enhancements for the query path. Over the last few months a lot of work has gone in to improve the query performance and I’m eager to try them out.
- Strategy for performance isolation in the storage path (distributor, ingesters) for various tenants.
- Ability to efficiently scale out rulers used for generating alerts.
- Better documentation!
We at Aspen Mesh are excited to work with the open source community to help with these and other initiatives. If you have any questions on how to get started or running Cortex in production, I’m always happy to talk. You can find me at firstname.lastname@example.org.
Neeraj Poddar is the Platform Lead at Aspen Mesh. He has worked on various aspects of operating systems, networking and distributed systems over the span of his career. He is passionate about developing efficient and performant distributed applications. At Aspen Mesh, he is currently building an enterprise service mesh and their hosted SaaS platform. In his free time you can find him playing racquetball and gaining back the calories spent playing by trying out new restaurants.