Kubernetes Observability: Log Aggregation Using ELK Stack
Logging, when used in the earliest design stages, helps diagnose bugs, gain insight into system behavior and spots potential issues before they occur."
Logging In The Cloud-Native Age
In the old days, all components of your infrastructure were well-defined and well-documented. For example, a typical web application could be hosted on a web server and a database server. Each component saved its own logs in a well-known location: /var/log/apache2/access.log, /var/log/apache2/error.log and mysql.log.
Back then, it was very easy to identify which logs belonged to which servers. In a highly complex environment, for example, you could have four web servers and two database engines, which are part of a cluster.
Let’s fast forward to the present day where terms like cloud providers, microservices architecture, containers, ephemeral environments, etc. are part of our everyday life. In an infrastructure that’s hosted on a container orchestration system like Kubernetes, how can you collect logs? The highly complex environment that we mentioned earlier could have dozens of pods for the frontend part, several for the middleware, and a number of StatefulSets. We need a central location where logs are saved, analyzed, and correlated. Since we’ll be having different types of logs from different sources, we need this system to be able to store them in a unified format that makes them easily searchable.
Logging Patterns In Kubernetes
Now that we have discussed how logging should be done in cloud-native environments, let’s have a look at the different patterns Kubernetes uses to generate logs.
The Quick Way To Obtain Logs
By default, any text that the pod outputs to the standard output STDOUT or the standard error STDERR can be viewed by the kubectl logs command. Consider the following pod definition:
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox
args: [/bin/sh, -c,
'i=0; while true; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done']
This pod uses the busybox image to print the current date and time every second indefinitely. Let’s apply this definition using kubectl apply -f pod.yml. Once the pod is running, we can grab its logs as follows:
$ kubectl logs counter
0: Sat Nov 2 08:46:40 UTC 2019
1: Sat Nov 2 08:46:41 UTC 2019
2: Sat Nov 2 08:46:42 UTC 2019
3: Sat Nov 2 08:46:43 UTC 2019
Aggregating Logs
The kubectl log command is useful when you want to quickly have a look at why a pod has failed, why it is behaving differently or whether or not it is doing what it is supposed to do. However, when you have several nodes with dozens or even hundreds of pods running on them, there should be a more efficient way to handle logs. There are a few log-aggregation systems available including the ELK stack that can be used for storing large amounts of log data in a standardized format. A log aggregation system uses a push mechanism to collect the data. This means that there must be an agent installed on the source entities that collects and sends the log data to the central server. For ELK stack, there are several agents that can do this job including Filebeat, Logstash, and fluentd. If you are installing Kubernetes on a cloud provider like GCP, the fluentd agent is already deployed in the installation process. For GCP, fluentd is already configured to send logs to Stackdriver. However, you can easily change the configuration to send the logs to a different target.
To abide by this pattern, Kubernetes offers two ways out of the three available:
Using a Daemonset: a Daemonset ensures that a specific pod is always running on all the cluster nodes. This pod runs the agent image (for example, fluentd) and is responsible for sending the logs from the node to the central server. By default, Kubernetes redirects all the container logs to a unified location. The daemonset pod collects logs from this location.
Using a Sidecar: a Sidecar is a term used to refer to containers running on the same pod as the application container. Due to the way pods work, the sidecar container has access to the same volume and share the same network interface with the other container. A sidecar container can send the logs either by pulling them from the application (like through an API endpoint designed for that purpose) or by scanning and parsing the log files that the application stores (remember, they are sharing the same storage).
Using the application logic: this does not need any Kubernetes support. You can simply design the application so that it sends logs periodically to the central log server. However, this is not a recommended approach because the application would be tightly coupled to its log server. You will have to generate your logs in the specific format that the server accepts. If you decide to switch to another server, you will have to modify the application code. On the other hand, if you hand the log collection, parsing, and pushing to the sidecar container, you only need to change the sidecar image when choosing a different log server. The application container remains intact.
What Is The ELK Stack?
The ELK stack is a popular log aggregation and visualization solution that is maintained by elasticsearch. The word “ELK” is an abbreviation for the following components:
ElasticSearch: this is where the data gets stored.
Logstash: the program responsible for transforming logs to a format that is suitable for being stored in the ElasticSearch database.
Kibana: where you can communicate with the Elasticsearch API, run complex queries and visualize them to get more insight into the data. You can also use Kibana to set and send alerts when a threshold is crossed. For example, you can get notified when the number of 5xx errors in Apache logs exceeds a certain limit.
LAB: Collecting And Aggregating Logs In A Cloud-Native Environment Using Kubernetes And The ELK Stack
In this lab, we will demonstrate how we can use a combination of Kubernetes for container orchestration and the ELK stack for log collection and analysis with a sample web application. For this lab, you will need admin access to a running Kubernetes cluster and the kubectl tool installed and configured for that cluster.
Installing Elasticsearch
We start by installing the Elasticsearch component. We are going to create a service account to be used by the component. We don’t want to give it admin access, it only needs read access to services, namespaces, and endpoints. Let’s start by creating the necessary resources to activate this account: the service account, the cluster role, and the cluster role binding:
apiVersion: v1
kind: ServiceAccount
metadata:
name: elasticsearch-logging
namespace: kube-system
labels:
k8s-app: elasticsearch-logging
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: elasticsearch-logging
labels:
k8s-app: elasticsearch-logging
rules:
- apiGroups:
- ""
resources:
- "services"
- "namespaces"
- "endpoints"
verbs:
- "get"
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: kube-system
name: elasticsearch-logging
labels:
k8s-app: elasticsearch-logging
subjects:
- kind: ServiceAccount
name: elasticsearch-logging
namespace: kube-system
apiGroup: ""
roleRef:
kind: ClusterRole
name: elasticsearch-logging
apiGroup: ""
Save this definition to a file and apply it. For example:
$ kubectl apply -f rbac.yml
serviceaccount/elasticsearch-logging created
clusterrole.rbac.authorization.k8s.io/elasticsearch-logging created
clusterrolebinding.rbac.authorization.k8s.io/elasticsearch-logging created
Next, we need to deploy the actual Elasticsearch cluster. We use a Statefulset for this purpose because we need elasticsearch to have well-defined hostnames, network and storage. Our Statefulset definition may look as follows:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch-logging
namespace: kube-system
labels:
k8s-app: elasticsearch-logging
spec:
serviceName: elasticsearch-logging
replicas: 2
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
k8s-app: elasticsearch-logging
template:
metadata:
labels:
k8s-app: elasticsearch-logging
spec:
serviceAccountName: elasticsearch-logging
containers:
- image: elasticsearch:6.8.4
name: elasticsearch-logging
ports:
- containerPort: 9200
name: db
protocol: TCP
- containerPort: 9300
name: transport
protocol: TCP
volumeMounts:
- name: elasticsearch-logging
mountPath: /data
env:
- name: "NAMESPACE"
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumes:
- name: elasticsearch-logging
emptyDir: {}
initContainers:
- image: alpine:3.6
command: ["/sbin/sysctl", "-w", "vm.max_map_count=262144"]
name: elasticsearch-logging-init
securityContext:
privileged: true
I’m going to discuss the important parts of this definition only.
Lines 35-39: We inject the namespace that Elasticsearch is using through an environment variable, NAMESPACE. We are using the downward API to grab the name of the current namespace.
Lines 40-42: we are using the emptyDir volume type. In a real scenario, you may want to use persistent volumes.
Lines 43-48: Notice that Elasticsearch requires that you set the vm.max_map_count Linux kernel parameter to be at least 262144. So, we use an init container that sets this parameter for us before the application starts. Setting kernel parameters requires that the container has root privilege and access to modify kernel parameters. So, we set the privileged parameter to true.
The last part we need here is the Service through which we can access the Elasticsearch databases. Add the following to a YAML file and apply it:
apiVersion: v1
kind: Service
metadata:
name: elasticsearch-logging
namespace: kube-system
labels:
k8s-app: elasticsearch-logging
spec:
ports:
- port: 9200
protocol: TCP
targetPort: db
selector:
k8s-app: elasticsearch-logging
Notice that we didn’t specify any means for external access through this Service. The Service type is clusterIP which means that it is accessible only from within the cluster.
Let’s apply that last definition to create the service. You should now be able to view the default welcome message of Elasticsearch by using port forwarding as follows:
kubectl port-forward -n kube-system svc/elasticsearch-logging 9200:9200
Now, you can use curl or just open your browser and navigate to localhost:9200. For example, curl localhost:9200
Installing Logstash
Logstash acts as an adapter that receives the raw logs, and formats it in a way that Elasticsearch understands. The tricky part about Logstash lies in its configuration. The rest is just a deployment that mounts the configuration file as a configMap and a Service that exposes Logstash to other cluster pods. So, let’s spend a few minutes with the configMap. Create a new file called logstash-config.yml and add the following lines to it:
apiVersion: v1
kind: ConfigMap
metadata:
name: logstash-configmap
namespace: kube-system
data:
logstash.yml: |
http.host: "0.0.0.0"
path.config: /usr/share/logstash/pipeline
logstash.conf: |
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
}
geoip {
source => "clientip"
}
}
output {
elasticsearch {
hosts => ["elasticsearch-logging:9200"]
}
}
The configMap contains two files: logstash.yml and logstash.conf. The first file has just two lines: it defines the network address on which Logstash will listen, we specified 0.0.0.0 to denote that it needs to listen on all available interfaces. The second line specified where Logstash should find its configuration file which is /usr/share/logstash/pipeline. This configuration path is where the second file (logstash.conf) resides. That second file is what instructs Logstash about how to parse the incoming log files. Let’s have a look at the interesting parts of this file:
- The input stanza instructs Logstash as to where it should get its data. The daemon will be listening at port 5044 and an agent (Filebeat in our case) will push logs to this port.
- The filter stanza is where we specify how logs should be interpreted. Logstash uses filters to parse and transform log files to a format understandable by Elasticsearch. In our example, we are using grok. Explaining how the Grok filter works is beyond the scope of this article but you can read more about it here. We are using one of the options available for Grok out of the box, which is used for parsing Apache logs in the combined format (COMBINEDAPACHELOG). Since it’s a popular log format, grok can automatically extract key information from each line and convert it to JSON format.
- The date stanza is used for adding a timestamp to each logline. You can use the timestamp to configure exactly how the timestamp would appear.
- The geoip part is used to add the client’s IP address to the log so that we know where it is coming from.
- The output part defines the target, where Logstash should forward the parsed log data. In our lab, we want Logstash to forward it to the Elasticsearch cluster. We specify the service name without the need to add the namespace and the rest of the URL (like in elasticsearch-logging.kube-system.svc.cluster.local) because both resources are in the same namespace.
Let’s apply this configMap and create the necessary deployment.
Create a new file called logstash-deployment.yml and add the following lines to it:
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: logstash-deployment
namespace: kube-system
spec:
replicas: 1
template:
metadata:
labels:
app: logstash
spec:
containers:
- name: logstash
image: docker.elastic.co/logstash/logstash:6.3.0
ports:
- containerPort: 5044
volumeMounts:
- name: config-volume
mountPath: /usr/share/logstash/config
- name: logstash-pipeline-volume
mountPath: /usr/share/logstash/pipeline
volumes:
- name: config-volume
configMap:
name: logstash-configmap
items:
- key: logstash.yml
path: logstash.yml
- name: logstash-pipeline-volume
configMap:
name: logstash-configmap
items:
- key: logstash.conf
path: logstash.conf
The deployment uses the configMap we created earlier, the official Logstash image, and declares that it should be reached on port 5044. The last resource we need here is the Service that will make this pod reachable. Create a new file called logstash-service.yml and add the following lines to it:
kind: Service
apiVersion: v1
metadata:
name: logstash-service
namespace: kube-system
spec:
selector:
app: logstash
ports:
- protocol: TCP
port: 5044
targetPort: 5044
Installing Filebeat Agent
Filebeat is the agent that we are going to use to ship logs to Logstash. We are using a DaemonSet for this deployment. A DaemonSet ensures that an instance of the Pod is running each node in the cluster. To deploy Filebeat, we need to create a service account, a cluster role, and a cluster role binding the same way we did with Elasticsearch. We also need a configMap to hold the instructions that Filebeat would use to ship logs. I’ve combined all the required resources in one definition file that we’ll discuss:
apiVersion: v1
kind: ServiceAccount
metadata:
name: filebeat
namespace: kube-system
labels:
k8s-app: filebeat
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: filebeat
labels:
k8s-app: filebeat
rules:
- apiGroups: [""] # "" indicates the core API group
resources:
- namespaces
- pods
verbs:
- get
- watch
- list
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: filebeat
subjects:
- kind: ServiceAccount
name: filebeat
namespace: kube-system
roleRef:
kind: ClusterRole
name: filebeat
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
namespace: kube-system
labels:
k8s-app: filebeat
data:
filebeat.yml: |-
filebeat.config:
prospectors:
# Mounted `filebeat-prospectors` configmap:
path: ${path.config}/prospectors.d/*.yml
# Reload prospectors configs as they change:
reload.enabled: false
modules:
path: ${path.config}/modules.d/*.yml
# Reload module configs as they change:
reload.enabled: false
output.logstash:
hosts: ['logstash-service:5044']
---
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-prospectors
namespace: kube-system
labels:
k8s-app: filebeat
data:
kubernetes.yml: |-
- type: docker
containers.ids:
- "*"
processors:
- add_kubernetes_metadata:
in_cluster: true
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: filebeat
namespace: kube-system
labels:
k8s-app: filebeat
spec:
template:
metadata:
labels:
k8s-app: filebeat
spec:
serviceAccountName: filebeat
terminationGracePeriodSeconds: 30
containers:
- name: filebeat
image: docker.elastic.co/beats/filebeat:6.8.4
args: [
"-c", "/etc/filebeat.yml",
"-e",
]
securityContext:
runAsUser: 0
volumeMounts:
- name: config
mountPath: /etc/filebeat.yml
readOnly: true
subPath: filebeat.yml
- name: prospectors
mountPath: /usr/share/filebeat/prospectors.d
readOnly: true
- name: data
mountPath: /usr/share/filebeat/data
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: config
configMap:
defaultMode: 0600
name: filebeat-config
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: prospectors
configMap:
defaultMode: 0600
name: filebeat-prospectors
- name: data
emptyDir: {}
Quite a long file but it’s easier than it looks. Again, we’ll discuss the important parts:
Lines 1-36: We create the necessary service account, cluster role and cluster role binding with read-only access to the resources of interest (pods and namespaces).
Lines 57,58: in the configMap that holds the Filebeat configuration, we specify that it needs to ship the log data to Logstash. We’re specifying the Service short URL since both resources live in the same namespace.
Lines 119-121: Among the mounted filesystems that Filebeat will have access to, we are specifying /var/lib/docker/containers. Notice that the volume type of this path is hostPath (line 120), which means that Filebeat will have access to this path on the node rather than the container. Kubernetes uses this path on the node to write data about the containers, additionally, any STDOUT or STDERR coming from the containers running on the node is directed to this path in JSON format (the standard output and standard error data is still viewable through the kubectl logs command, but a copy is kept at that path).
Apply the above definition file to the cluster. Now the last part remaining in the stack is the visualization window, Kibana. Let’s deploy it.
Installing Kibana
Kibana is just the UI through which you can execute simple and complex queries against the Elasticsearch database. Kibana needs to know the URL through which it can reach Elasticsearch, we’ll add this through an environment variable. No further configuration is needed (as far as this lab is setup) so we are not using a configMap. The definition file for Kibana may look as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana-logging
namespace: kube-system
labels:
k8s-app: kibana-logging
spec:
replicas: 1
selector:
matchLabels:
k8s-app: kibana-logging
template:
metadata:
labels:
k8s-app: kibana-logging
spec:
containers:
- name: kibana-logging
image: docker.elastic.co/kibana/kibana-oss:6.8.4
env:
- name: ELASTICSEARCH_URL
value: http://elasticsearch-logging:9200
ports:
- containerPort: 5601
name: ui
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: kibana-logging
namespace: kube-system
labels:
k8s-app: kibana-logging
kubernetes.io/name: "Kibana"
spec:
type: NodePort
ports:
- port: 5601
protocol: TCP
targetPort: ui
nodePort: 32010
selector:
k8s-app: kibana-logging
Let’s have a look at the interesting parts of this definition:
Lines 22,23: we’re specifying the Elasticsearch URL. As usual, we’re putting the short form of the service URL.
Lines 38,43: the Service we are creating needs to have external exposure so that we can log in and view the logs. So, we are using the NodePort Service type and specifying 32010 as the port number. This port is accessible for any node in your cluster. Note that - depending on your underlying infrastructure or the cloud provider hosting - you may need to enable this port on the firewall.
Apply the above definition to the cluster, wait for a few moments for the pod to get deployed and navigate to http://node_port:32010. You should see Kibana’s dashboard, click on “Skip” to avoid Kibana adding sample data. Now, click on Discover on the left panel. You should see something similar to the following:
Type logstash* as the index pattern. This instructs Kibana to query Elasticsearch’s indices that match this pattern. Click “Next step”.
Select @timestamp as the Time Filter field name and click “Create index pattern”.
The index will get created in a few seconds. Now click on “Discover”. You should see something like the following:
That’s a lot of data! The reason is that Filebeat is shipping all the log data that the node is generating about the containers running inside it. Let’s make things more interesting by deploying a sample web server and demonstrating how we can grab its logs collectively from multiple pods.
Deploying A Sample Application: Apache Webserver
Applications should be designed so that they log their output and error messages to STDOUT and STDERR. As mentioned earlier, Docker (and Kubernetes in clustered environments) automatically keep a copy of those logs on the node, so that agents like Filebeat can ship them together with the node logs. The Apache image (httpd) follows this logging pattern so we’ll deploy it as a sample application. The following definition file contains the Deployment and Service resources necessary to bring the webserver up on multiple pods:
Apply this definition to your cluster. Now, let’s test and see if the webserver is running, and make a few requests to generate some log data. First, we need to use port-forwarding as this webserver is not publicly exposed:
kubectl -n default port-forward svc/webserver 8080:80
If you open the browser and navigate to localhost:8080, you should find the famous “It works!” message. Refresh the page a few times to increase the probability of having different pods responding to your requests.
Testing The Workflow
By now you should have five components running on your cluster: Apache, Filebeat, Logstash, Elasticsearch, and Kibana. In the previous step, you made a few requests to the web server, let’s see how we can track this in Kibana.
Open the dashboard and make sure you cover at least the past hour as shown:
If you click on “Add a filter” on the left, you can see a lot of possible tags that you can use to select the log messages that we are interested in:
Since our web server pod had the label app=web, we can select that in our filters as shown:
Of course, your output may differ but it should be close to the following:
Note that the graph is displaying the number of times the log message matches our filter (that it is coming from the resource with label app=web) and when.
The message tag is displaying the exact logline that was output by Apache. But this is not very useful as we can always get the same output using the kubectl log command. The real power of the ELK stack comes from the ability to aggregate several logs from different sources. So, for example, we can count all the 404 errors that occurred in the last hour on all pods that serve our application, even a specific pod. Let’s test that.
In your browser, generate several requests to http://localhost:8080/notfound. We do not have a file in our web directory called not found, so Apache will simply generate the appropriate 404 messages that the requested resource was not found on the server.
On Kibana, and while you still have the previous filter set, add the following filter:
By clicking Save, you are applying this filter on the data that you have. You should see something similar to the following:
The graph displays the number of 404 messages that occurred and their time of occurrence. The logline has the message itself specifying which file was requested and was not found. You also have additional data that you can use for narrowing down the selection even further like the node name, the container name, and the pod name.
About ELK Stack Components Compatibility
If you notice, we used the same minor and major version numbers when deploying the ELK stack components, so that all of them could be versioned 6.8.4. This is intentional as the ELK stack components will work with each other as long you follow the compatibility matrix. You are strongly encouraged to review the Elastic Support Matrix document before attempting to deploy the ELK stack in your environment.
Production Environment Security Considerations
We wanted this lab to be as simple as possible so we ignored additional levels of configuration that would have distracted the reader from the core concepts that we wanted to deliver. However, in production environments, you should consider the following uncovered topics:
- The Elasticsearch - as of the time of this writing, there are no authentication mechanisms yet. So, you may want to add a reverse proxy that implements basic authentication to protect the cluster (even if it is not publicly exposed).
- Kibana has its own methods of authentication. So, you can use that or you can add another reverse proxy server in front of it with basic authentication.
- Once authentication is enabled, different services may need credentials to contact each other. Those credentials should be stored in Secrets.
- In our lab, we used the NodePort service type to expose our Kibana service publicly. Using NodePort has its own shortcomings because node failure detection needs to be implemented on the client-side. A better solution is to use a Load Balancer or an Ingress controller.
- You should implement SSL on any publicly accessible endpoint. If you are using the cloud-provider’s Load Balancer, this might have already been covered in the service. You can also use your own certificates preferentially.
TL;DR
- Logging has always been a top priority that should be taken care of in the earliest design stages. Using logging, you can not only diagnose bugs, gain insight into how the system is behaving but you can also use it to spot potential issues before they occur.
- In non-cloud-native environments, logging was not much of an issue because each component had a well-defined location. For example, the webserver is hosted on machine A, the application is on machine B and the database is on machine C. You could easily collect and identify which logs are coming from which source.
- In microservices-dominated environments, logging and log-collection should be done differently. You’re no longer aware of which specific node/pod responded to which web request. Maybe some requests are failing on a specific pod but are responded to normally on another. For that reason, we use log aggregation systems like ELK stack.
- ELK is an open-source project maintained by Elastic.co. it consists of three components: Elasticsearch database, Logstash adapter, and Kibana UI.
- We can easily deploy the ELK stack on Kubernetes by using a StatefulSet for the deployment, a configMap for holding the necessary configurations and the required service account, cluster role and cluster role binding.
- ELK stack works by receiving log data from different sources. For the source to send its logs, it needs an agent. There are many agents that can do this role like Logstash, Fluentd, and Filebeat.
- Once the data is stored in Elasticsearch, you can use Kibana to run queries against the database. Through visualizations, you can gain observability into different aspects of the running application. For example, in the lab, we were able to determine the frequency of 404 error responses regardless of the node, pod, or container they originated from.