On June 28th, 2017 Luke Marsden from Weaveworks gave a free online talk entitled, “Observability beyond logging for Java Microservices”. In the talk he discusses how understanding what’s happening with a Java microservices-based application running in a dynamic Kubernetes environment is challenging at the best of times and that achieving true observability requires tools that are up for the job.
Luke provided an overview of the best solutions available to assist with observability in a microservices environment and when used together can more effectively diagnose, understand and troubleshoot distributed applications:
- Prometheus Monitoring and time-series metrics
- OpenTracing and the ability to trace a request all the way through a system
- Microservices visualization and troubleshooting
What do we mean by observability?
When it comes to monitoring multiple services in a cluster, most people are familiar with log aggregation. Distributed systems are frequently monitored by aggregating the log message output into a centralized logging system like Loggly or through ElasticSearch that is analyzed with LogStash.
However, when you need to measure things like latency, error rates, and throughput, other types of tools are required: ones that allow you to observe these metrics in an application as it’s running in production and not just observed from the outside.
According to Luke, “...observability is at least the sum of your logging infrastructure, your monitoring infrastructure, your tracing infrastructure and your visualization capabilities.”
Weave Cloud Development Lifecycle
Microservices observability is an integral part of Weave Cloud that helps developers ship code faster.
The lifecycle starts with a new feature idea that gets passed onto the development team. They write the code for it and merge it into Git.
That code for the new feature runs through your CI system and it builds a Docker Container that is uploaded to a Docker container registry.
The Deploy feature of Weave Cloud takes your newly built image and deploys it safely and reversibly to your Kubernetes cluster. At the same time any configuration files that describe the service are automatically updated and those are checked into Git as well.
Once the new container has spun up in its environment, you can use Weave Cloud to verify that the service is running as intended. If there are problems, the code goes back through the loop again, and if it’s working as intended, the new service is pushed to production where it’s monitored and you’re alerted to issues with Prometheus.
Microservices and container orchestration
Docker containers lend themselves well to a microservices architecture. We won’t go into whether you should use microservices here. See “What are Microservices?” for more info.
Any non-trivial containerized application needs to run on more than one machine, and once that occurs, you need to answer questions such as, “how do I manage those containers across multiple machines?”
Container Orchestrators like Kubernetes or Docker Swarm allows you to run an application across multiple machines without having to manually think about where each piece of that application should go. Since they also schedule your workflows onto different machines for you, they are also called ‘Container Schedulers’.
What’s common between different orchestrators is that they provide networking and also a way for different applications in different pods or containers to find and communicate with one another via Service Discovery.
“Getting your application running on one of these container orchestration frameworks is actually the easy bit,” says Luke.
Once it’s running there, the hard part is when you have a problem with your application. How do you diagnose that problem, and how do you redeploy that fix back to production or ‘go back around the loop’ so that new features are delivered quickly to your users.
Prometheus is fast becoming the defacto standard for monitoring microservices in Kubernetes. Prometheus is a labels-based time-series database, and what that means is that it is a database with a time-series data type.
What is a Time Series?
These data types and time series are indexed by key value pairs that say key one has value a, key two has value b, etc. So for example you could say that at time `t1` the value was `v1` and at `t2` the value was `p2` and so on. These values are stored as 64-bit floating point numbers as vectors.
Client libraries available in Prometheus allow you to instrument your code to capture the meaningful application metrics that you want to measure via metric types:
- Counters -- which graphs data as always going to the right.
- Gauges -- shows data going up and down
- Histograms -- which sample a distribution over time
- Summaries -- can aggregate data across time
Once the metrics in your code are instrumented for these metric types, PromQL, the Prometheus Query Language, is used to make sense out of it.
White Box vs. Black Box Monitoring
Prometheus instrumentation can be used to set up both white and black box monitoring. Luke discussed the differences between these types of monitoring.
Black box is meant to be something that you can’t see inside nor do you understand the inner workings. You can therefore only ‘poke’ at it from the outside. White box monitoring is more like a transparent box where you can see inside the system and how it works.
An example of a black box type of instrumentation with Prometheus is checking the CPU usage of your machines by using a Gauge.
White box monitoring is concerned with what is occurring inside your application. For example, in an e-commerce application, you would measure the number of successful orders placed versus the number of orders that dropped.
In Luke’s example, a Java class from the client library in Prometheus that monitors HTTP request latency was declared, and this was done using a histogram metric type.
A powerful Prometheus feature is its use of labels. In this example, labels were applied with the counter metric to all of the services running in the application. This enables a PromQL query to return the request durations of all services in the application into a single graph:
But what isn’t immediately obvious with Prometheus are the relationships between those services or components and why there was a latency during a particular request. In addition, if you only looked at the aggregated log output by all of the different components in your system, it would be difficult to determine the exact event that caused this latency.
Tracing provides the answers to “where that time was spent” during a distributed request. Tracing can be conceptualized as a map of causality through multiple layers of microservices.
Tracing itself is a workflow of events through your system. openTracing is a vendor-neutral implementation of tracing with its own set of APIs.
Tracing is instrumented in your code with `spans`, which can be thought of as a tree structure of events within your entire trace pathway. Spans can be annotated with timing information and messages, and they communicate when services RPC (Remote Procedure Call) to each other. The messages are injected into HTTP headers and propagated internally in ‘contexts’, where trace-instrumented code can report back to the tracing server and reassemble the causality.
Messages added to your spans can be read together with your aggregated logs to provide a richer form of logging and a deeper understanding to that mysterious 500 error.
Luke described several applications that have implemented the openTracing standard APIs such as: Zipkin, Appdash, and LightStep so that causality can be better understood, when identifying problems caught with Prometheus metrics and aggregated logs.
The third tool in your monitoring distributed systems arsenal is visualization. This is a tool that Weave Cloud provides and with it you can test egress and ingress on a system-wide level, understand interactions between components, and then drill down on the details at the container level. A handy terminal is also provided that lets you investigate your container at the OS level.
When running microservices in a distributed environment, identifying the root cause of problems requires a more rigorous approach than what log aggregation provides on its own, and is best handled through a multi-pronged approach:
- Prometheus Monitoring for identifying top level events as they occur.
- OpenTracing for identifying causality
- Microservices visualization in Weave Cloud for troubleshooting
Check out the Weave Cloud documentation for information on Exploring & Troubleshooting.
To see the talk in its entirety, watch the video here: