Cloud-Native Logging And Monitoring Pattern
Want to learn about various Kubernetes patterns? Check out our blog post regarding the monitoring pattern.
The Need For Observability in The Cloud-Native World
Any system that runs without exposing the necessary information about its health, the status of the applications running inside it, and whether or not something has gone wrong is simply useless. But this has been said many times before, right? Since the early ages of mainframes and punch-card programming, there have always been gauges for monitoring the system’s temperature and power among other things.
The important aspect of those old systems is that they rarely change. If there is a modification to occur, monitoring was included in the plan and, thus, it is not a problem. Additionally, the system operated as a whole, thus it was easier to observe as there weren’t several moving parts. But in the cloud-native world that we live in today, change is nothing but rare; it’s the norm. The microservices pattern that cloud-native apps follow makes the system a highly distributed one. As a result, metric collection, monitoring, and logging need to be approached differently. Let’s have a brief look at some of the monitoring challenges in a cloud-native environment:
- Assuming that you run your application in a Kubernetes cluster. Kubernetes uses pods as its workhorses; they are used to run the application containers. But pods are ephemeral by nature. They get deleted, recreated, restarted, and moved from one node to another. So, if the application is misbehaving, where would you go to find clues if the runtime environment is ever-changing?
- Due to the highly distributed nature of cloud-native apps, it became increasingly harder to trace a request that went through a dozen services to fulfill a client’s need. If the wrong response was received or an error message, which component would you suspect as the probable cause?
Cloud-Native Logging
As a rule of thumb, logs should be collected and stored outside the node that’s hosting the application. The reason behind that is that you need to have a single place where you can view, analyze, and correlate different logs from different sources. In highly-distributed environments, the need becomes even more important as the number of services (and logs) increases. There are many tools that do that task for you; for example, Filebeat, Logstash, Log4j among others. The application should log important events as they happen, but it should not decide where the logs would go; that’s the deployment’s decision. For example, a Python application may have the following line to alert that a new user was added to the database:
log.info("{} was added successfully".format(user))
Of course, the application needs to know where this line of text should go. Since we’re leaving this to the environment, we instruct the application to send it to STDOUT and STDERR (if the log was an error event). There are many reasons why you should follow this practice:
- If your environments are ephemeral (for example, Kubernetes pods), having the container tightly coupled to its logs deprives you of viewing those logs when the environment is gone (crashed, deleted, etc.)
- In a microservices architecture, it’s perfectly common to have a backend written in PHP, the middleware built with Python, and the frontend using NodeJS. Each of those systems may be using a framework that has its own way of logging, with its own file locations. Directing logs to STDOUT and STDERR guarantees that we have a consistent way of collecting logs.
- The cloud-native architecture is so flexible that not only can you use different programming languages for different components, but you can also use dissimilar operating systems. For example, the backend may be using a legacy system that only works on Windows machines, while the rest of the infrastructure is running on Linux. But since STDOUT and STDERR are inherent aspects of any OS by design, it provides even more consistency to use them as uniform log sources.
- STDOUT and STDERR are standard streams. Hence, they perfectly fit into the log nature. A log is a stream of data; it does not have a start or end, it just produces strings of text when events occur. So, it’s easier to initiate API calls to the log aggregation service from a stream than to use a file.
Test Case: Kubernetes Logging
Assuming that your application is hosted on three pods on a Kubernetes cluster. The application is configured to print a line of log whenever the application receives and responds to a web request. However, the cluster is designed so that a Service routes the incoming request to a random backend Pod. Now, the question is; when we are to view the application logs, which Pod should we query? The following diagram depicts this situation:
Because this scenario is common, Kubernetes provides a simple command that allows you to view collective logs for from all the Pods that match a specific label:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
frontend-5989794999-6wxhm 1/1 Running 0 2d4h
frontend-5989794999-fkq56 1/1 Running 0 124m
frontend-5989794999-nzn2f 1/1 Running 0 124m
As expected, the command returned a verbose output because it’s the combined logs of all the Pods labeled app=frontend. The log entry that we are interested in is the one highlighted in yellow. Notice that the log entry contains the IP address of the container followed by the date and time of the request, the HTTP verb, and other data. The other two Pods produce similar output but with a different IP address. The application we use in this example is Python Flask. The format of the log message is not dictated by Kubernetes, it is rather how Flask logs its events. In a real-world application, you should configure the application to use the hostname of the container when logging events. Additionally, the container’s hostname can be set whenever it starts.
Cloud-Native Metrics
There is an important difference between logs and metrics. Logs describe events as they happen. Some events are of absolute importance (critical) while others are of less and less importance till we reach the debug level, which is the most verbose. Metrics, on the other hand, describe the current state of the application and its infrastructure. You can set alarms when certain metrics get to a predefined threshold. For example, when the CPU crosses 80% or when the number of concurrent web requests reaches 1k (one thousand).
You should implement the necessary logic for exposing application metrics in your code. Most frameworks already have that implemented anyway, but you can (and should) extend the core functionality with your own custom metrics whenever they’re necessary.
When obtaining metrics, we basically have two ways of doing it: the push-based, and pull-based method. Let’s have a look at each:
The Pull-Based Method
In this approach, the application exposes one or more health endpoints. Hitting those URLs through an HTTP request returns health and metrics data for that application. For example, our flask API may have a /health endpoint that exposes important information about the current status of the service. However, implementing this approach comes with a challenge:
Which Pod are You Pulling From?
In a microservices architecture, more than one component is responsible for hosting an application part, each service has more than one pod for high availability. For example, a blogging app may have an authentication service, posts, and comments services. Each service is typically behind a load balancer to distribute load among its replicas. Your log-collection controller hits the health endpoints of the application and stores the results in a time-series database (for example, InfluxDB or Prometheus). So, when the controller needs to pull a service for its metrics, it will probably hit the URL of the load balancer rather than the service itself. The load balancer routes the request to the first available backend pod, which doesn’t cover all the pods’ health. The following illustration demonstrates this situation:
In addition, since a different pod may reply to the health check request, we need a mechanism by which each pod can identify itself in the response (a hostname, an IP address, etc.)
One possible solution to this problem is to make the monitoring system responsible for detecting the URL(s) of the different services in the application instead of relying on the load balancer that is moving service discovery to the client-side. The controller here is doing two tasks periodically:
-
Discovering which services are available. One way of doing this in Kubernetes is by placing the pods behind a headless service that returns the URLs of the individual pods.
-
Pull the health metrics of the pods by hitting the respective URL for each of the discovered pods.
The following illustration depicts this approach:
With this approach, the monitoring system must have access to the internal IP addresses of the pods. Thus, it must be deployed as part of the cluster. Popular solutions like Prometheus offer operators that provide deep integration not only with the pods running inside the cluster but also with the system-level metrics from the cluster itself.
The Push-Based Method
In this approach, the application is responsible for finding the address of the metrics server and pushing the data to it. At first, this method may seem to add an additional level of complexity to the application. Developers have to build the necessary modules for metric-collection and pushing. However, this can be totally avoided by using the sidecar pattern.
The sidecar pattern makes use of the fact that a pod can host more than one container, all of which share the same network address and volumes. They can communicate with each other through the loopback address localhost. Accordingly, we can place a container that is solely responsible for log and/or metrics collection and pushing them to the metric server or a log-aggregator.
Using this pattern, you don’t need to make changes to the application since the sidecar container already collects the necessary metrics through the health endpoint and sends it over to the server. If the server’s IP address, or type changes, we only need to change the implementation of the sidecar container. The application remains intact. The following figure demonstrates using the sidecar container (also called Ambassador or Envoy) to collect and push metrics:
As a bonus, this pattern also gives chance for exposing more metrics than the application does. This is possible because the sidecar container itself can expose additional data about the application performance; for example, the response latency, the number of failed requests and more.
TL;DR
- Being aware of how your application is behaving even before problems start to happen is not a new concept. The practice has been followed since the very old days of computing.
- Even in non-cloud-native environments, metrics must be stored outside the node that’s producing it for better visibility, correlation, and aggregation.
- In a microservices architecture, there are many challenges regarding log collection because of the many metric sources.
- Orchestration systems like Kubernetes have their own ways of dealing with multiple data sources. For example, using the -l argument, you can collect all the output of different pods under the same label.
- It’s much more efficient to broadcast logs to the standard output and standard error channels rather than writing them to local files. Later on, log-collection agents can deal with STDOUT and STDERR easily and more consistently across multiple operating systems/applications.
- When it comes to the metric collection, we basically have two ways of doing it in a cloud-native environment: pull-based and push-based.
- The pull-based method relies on the metric controller to discover and communicate with the appropriate service. It gets the metrics by hitting the health endpoint designated for this purpose.
- The push-based method depends on the service to determine the endpoint of the metrics server. To avoid adding overhead on the application, we use a sidecar container to collect and send the metrics to the server on the application’s behalf.