Kubernetes Monitoring with Prometheus
This page is for Prometheus beginners and experts alike and maybe you:
- Heard about Prometheus and want to try it now
- Already use Prometheus and have production questions about high availability (HA) and data storage
- Need to get the most out of Prometheus without a steep learning curve and you want to implement your own Kubernetes monitoring strategy from development to production.
Why Weaveworks for Prometheus?
At Weaveworks we believe Prometheus is the “must have” monitoring and alerting tool for Kubernetes and Docker. It provides by far the most detailed and actionable metrics and analysis, and it performs well under heavy loads and bursts. In addition to this, you get all of the benefits of a world-leading open source project. Prometheus is free at the point of use and covers many use cases with ease. At Weaveworks, we see it becoming a standard software tool for cloud native applications:
- Weave Cloud provides a SaaS version of Prometheus where you can take advantage of multi-tenancy, and long-term scalable data storage for your existing deployments.
- Weaveworks uses Prometheus in production for all of our own Kubernetes clusters and for our CI pipelines.
- We are active in the open source community and have created an open source storage engine for Prometheus called Weave Cortex, which is the backend for Weave Cloud.
Why choose Prometheus?
With the huge uptake of containers and microservices, monitoring solutions have to manage more services and servers than ever before. Not only are there more objects to manage, but cloud native apps are generating a ton of extra data that you need to keep track of. Collecting data from an environment composed of so many moving parts is complex and Prometheus is the best modern solution for these dynamic cloud environments. It was built specifically to monitor applications and microservices running in containers at scale and is native to containerized environments. Originally developed at Soundcloud, who are pioneers in the adoption of cloud technology, Prometheus was based on Borgmon -- a monitoring system developed at Google.
- Kubernetes integration—supports service discovery and monitoring of dynamically scheduled services.
- Flexible multi-dimensional data model—a labels-based time-series database. This data can be then queried via the PromQL language that allows you to diagnose a problem when it occurred without having to independently recreate the issue outside of the system after the event.
- Built-in Alertmanager—sends out notifications via a number of methods based on rules that you specify. This not only eliminates the need to source an external system and API, but it also reduced the interruptions for your development team.
- Supports whitebox and blackbox monitoring—provides extensive instrumentation client libraries and exporters to support both whitebox and blackbox monitoring. Including metrics exposed by the internals of the system like: logs, and interfaces or HTTP handlers that send out internal statistics, and Black box which is about monitoring things that affect your users (like a login service going down, or your site’s performance).
- Pull based metrics—a pull-based monitoring system means that your services don’t have to know where your monitoring system is located. You can simply expose your metric as a HTTP endpoint and Prometheus pulls the metrics from it.
More about why you need Prometheus
OSS Prometheus challenges
Because Prometheus is a new and powerful Cloud Native technology, it means that development and operations may initially be unfamiliar with it and so there will be a learning curve for both groups who need to work together.
Prometheus is capable of monitoring much more than your infrastructure; it can also analyze long-term trends in your software as it runs in production and can also debug performance. But with these capabilities, there are challenges. For example true observability may not be solved by using Prometheus alone.
If you are hosting your own Prometheus solution, suggesting an “instrument everything approach” can result in server overload. Your teams therefore need to decide which metrics are important, but if you use Weave Cloud, where we are hosting the metrics, you are free to instrument everything.
Instrumentation can be divided into two stages:
- Data Exporters—The OSS community for Prometheus provides a number of exporters for many standard services like MySQL and Redis where metrics are exported automatically once you’ve installed Prometheus or signed up for Weave Cloud.
- Client Libraries—These are libraries that help you create custom metrics for Prometheus to scrape
And if you’re unsure about which metrics are important, the RED methodology may help you decide. The RED method is a way to instrument your code for metrics that have the greatest impact on your users and ultimately your ROI. The methodology can be used in conjunction with traditional infrastructure monitorings to give you a complete picture of your app’s performance in a cloud native environment.
- The RED Method: key metrics for microservices architecture is a blog on how to select metrics for your app.
- The USE Method provides an overview on how to instrument for resources and infrastructure.
- Prometheus Supported Exporters is a list of supported exporters and integrations.
- Default Port Allocations a list of exporters developed outside of the Prometheus project
- Prometheus Exporters for Amazon Cloud Watch a github repository for Amazon Cloud Watch.
Observability vs. Monitoring
A popular topic lately is the differences between Observability vs. Monitoring. There are several really good blogs on this topic and a section in this page doesn’t really do the topic justice.
Briefly, observability is what we are trying to achieve when you are monitoring a distributed app. When monitoring multiple services in a cluster, most people are familiar with log aggregation. Distributed systems are frequently monitored by aggregating the log message output into a centralized logging system like Loggly or through ElasticSearch that is analyzed with LogStash or with the CNCF project Fluentd. But when you need to measure things like latency, error rates, and throughput, other types of tools are required: ones that allow you to observe these metrics in an application as it’s running in production. This is when you might consider pairing Prometheus with a tracing tool like Jaegar or OpenTracing so that you can achieve true observability of your distributed app.
Observability can then be defined as at least the sum of your logging infrastructure, your monitoring infrastructure, your tracing infrastructure and your visualization capabilities.
More information on Observability in distributed systems:
Because of the way in which Prometheus works — a pull-based system that listens on a single endpoint — it means that it can only operate as a single server per application. While it is relatively straightforward to set up a single Prometheus instance, it only works on one host per user. So if your application becomes more complex, this can become cumbersome. In other words, Prometheus does not run on multiple instances for a service or application.
One server per application is also a problem if Prometheus fails on that host. You’ll have to perform manual data retrieval if that happens.
Data Storage & Migration
By itself, Prometheus does not offer authentication and access control to its data in the monitoring system. You must build it yourself or provide a 3rd-party alternative. Also, there is no built-in storage service for your monitoring data. By default, all data resides on a local disk. To store data externally, you must integrate with a 3rd-party service.
And if you are changing over from an existing monitoring system, you’ll need to migrate any existing data to Prometheus data. Getting a handle on the exporters to do that can be particularly challenging. (The latest version of Prometheus may be addressing this with a new exporter architecture).
Alerting and High Availability
In production, managing alerts and creating graphs with PromQL requires some planning if you are using the Open Source UI. You may want to display metrics instead on larger dashboards —Grafana is a common open source choice— or you can correlate them with application management and logs using an open source map and console tool like Weave Scope. Finally, aggregated long term storage and HA may lead users to set up multiple instances and have them share a more reliable storage management service.
You can use Weave Cloud to host and scale your Prometheus instances as well as store your metric data; it comes with its own set of canned dashboards, but we also provide a Grafana Library (written in Python) if you prefer to use Grafana. Setting up alerts in Weave Cloud is a more streamlined user experience where rules and notifications are configured from within the Weave Cloud GUI.
Managed hosted Prometheus service with Weave Cloud
Weave Cloud, our SaaS product, simplifies Ops for developers and it includes a hosted and managed version of Prometheus for easy container monitoring.
Long term storage plus HA monitoring and alerting
Weave Cloud is a great choice for Prometheus users who want to instrument production systems. Weave Cloud is a highly available, scalable system that builds-in multi tenancy and orchestration to minimize risk of downtime, and offers long term data storage for stability and scale.
For users who want a local Prometheus, Weave Cloud’s data store can be used remotely. This removes the risk in the case of the local Prometheus becoming overwhelmed or being a point of failure. Weave Cloud’s hosted Alertmanager is HA and eliminates the need to configure multiple instances or a mesh network to in order to achieve HA by hand.
Easy to use GUI for alerting and team collaboration
Monitoring distributed systems is a collaboration between development and operations that requires specialized workflows. While open source Prometheus does provide some ‘out of the box’ functionality for cross-team collaboration running in Kubernetes, many users would also like assistance with code instrumentation, constructing PromQL queries, and information on how integrate Grafana.
Weave Cloud provides an enhanced Prometheus interface that is easy to use and with the addition of shared notebooks allows teams to drill-down on problems as they occur. You can use Weave Cloud to setup and configure rules against thresholds that when met, sends alerts to slack, email or to a browser so that your team can act quickly on the data presented.
If you prefer, Grafana may also be integrated as your dashboard for your local and/or long term data as needed.
Instrument everything without a performance hit
Hosted Prometheus also means that you can go ahead and instrument everything in your code without worrying about I/O performance. This means you can scale up to multiple teams and clusters.
Out-of-the-box metrics and visualization
With Weave Cloud much of your infrastructure is already instrumented and can be instantly viewed on a number of preconfigured charts. You can view the resource usage of your Kubernetes cluster by namespace and also step backward through time with the time travel feature to pinpoint any problems.
Deploy new features with confidence
Before deploying new features use the integrated deployment dashboards to create pre-flight checks before you deploy. Define your charts in the deployment dashboard and then check that your new code is operating as intended against a running cluster with the dry run feature.
Getting Started with Prometheus in Development
Depending on your requirements, there are two ways that you can get started with Prometheus. You can install it yourself in your own data centre or you can use Weave Cloud to host both Prometheus and its metrics for you.
After you’ve installed Prometheus and thought about which exporters you need, the next steps involve instrumenting your application, and then using PromQL to visualize metrics and to specify alert thresholds.
Here’s a set of links that will help you with those topics:
Graphing & Querying Metrics in Prometheus
Getting help with Prometheus
To get help with Prometheus and to learn how to contribute to the Open Source version of Prometheus, see Prometheus Community page.
Download our latest whitepaper, "Monitoring Cloud Native Applications" to learn about Prometheus and the different methodologies, metrics and approaches to effectively monitor microservices.