Debugging and Troubleshooting Microservices in Kubernetes with Ray Tsang (Google)
Google Developer Advocate, Ray Tsang shows us how to debug microservices running in Kubernetes.
James Governor from Redmonk discusses: DX, Guardrails, Golden Paths & Policy Management in Kubernetes
Liquid Metal is Here: Supported, Multi-Cluster Kubernetes on micro-VMs and Bare Metal
You aren't Doing GitOps without Drift Detection
Last fall Ray Tsang, Developer Advocate at Google for the Google Cloud Platform, came and showed us how to debug microservices running in Kubernetes. Carlos Leon from Container Solutions then followed up with a talk on Monitoring your microservices in Kubernetes.
To illustrate debugging and troubleshooting, Ray deployed a guestbook application that when you enter a name, allows you to send that person a greeting or a text message.
On the backend there’s a UI for entering the name and two services, one that manages and sends the message and another that keeps track of the guestbook and its users. Of course both of these services are backed by databases. The guestbook keeps its names in a MySQL persistent database and the text greeting and its session are stored in the in-memory database, Redis.
The app was running in Google Container Engine on five nodes in Kubernetes.
On displaying the app’s UI in a browser, a 500 internal server error was returned. Ray then showed us how to go about debugging this kind of error in Kubernetes:
- Check it in the same browser with a session and then try it in a different browser.
- Check that the app is working in your staging environment.
- Check that the images are the same on both environments by running ‘kubectl describes pods’ and then ensure that the YAMLs deployed on the staging server are the same as the ones deployed to production.
- Dive into the Logs and aggregate them if necessary.
- Isolate the instance to identify the root cause.
About the :latest Tag
On examining the Pods with ‘kubectl describe pods’ and looking at the deployed images, Ray noted that one image was deployed using the :latest tag. As a general rule it is not good practise to deploy something using the latest tag:
“You don’t know exactly when :latest is. It could be the latest from today, yesterday or last week -- you just have no idea,” Ray Tsang.
So if you see an image deployed using the latest tag, you will need to check the Image ID and specifically compare SHA256 hash codes to make sure you have the image you think you have deployed. You can copy that SHA256 hash code and then paste into your image registry to search for where that image came from.
Examine the Logs
If you’ve determined that the hash codes on your images are the same and you are still having a problem with your app, the next step to look at are the logs.
Kubernetes makes it really easy to get logs.
In this case we only have two instances on which to check the logs, but what if you have 10 or more that you need to check?
If that’s the case, then you will need to integrate an external log aggregator that downloads all of your logs so that you can pinpoint which one of your instances has a problem. Ray recommends that you use something like Elk stack which is open source and works well with Kubernetes.
In Google Cloud there is direct integration with Stackdriver logging. And with that you can drill right into your cluster and pick the namespace that you want to see. One feature that this integration provides is the ability to export the data to Big Data so that you can perform all kinds of interesting queries on that data.
Once you’ve pinpointed the instance on which the problem is occurring, you can further investigate the root cause of the problem by using a very useful Kubernetes feature called labels that allow you to dynamically isolate a pod (among other things). For example if you want to stop the pod from going through the loadbalancer you would set the flag dynamically to `--overwrite serving=false`.
With the instance isolated, you can dynamically enter it to do some deeper troubleshooting, for example you could try forwarding the port to your localhost to test to see if the app works there.
Real-time Visualization of your App
When you deploy your app to Kubernetes in Google Container Engine, any annotations you’ve added to your YAML file draws a map of your services that can show you what’s talking to what. But these charts are just documentation and if you haven’t updated your annotations it will be out of date.
To see your services in real time, Ray suggests using Weave Cloud which produces a live and interactive map of your services. Since StackDriver Trace is integrated into Google Cloud, you can use that to test and then view the connections between services in Weave Cloud.
Monitoring Your Microservices
Next up was Carlos León of Container Solutions who talked about how to monitor microservices in Kubernetes. In particular, Carlos demonstrated Weave Cloud’s monitoring feature that right out of the box, scrapes Kubernetes metrics every 20 seconds, stores it in the cloud and readies the metrics for you to drill down with Weave Cloud’s unique visualization UI.
Instrumenting Your Code
While there are many exporters available for standard packages like Kubernetes, when it comes to your application, you will have to instrument it yourself. There are a number of popular client libraries in Prometheus available for you to use.
Instrumenting is the simple part. What requires more thought is what to instrument.
RED Method: Application Level Metrics
The red methodology is a strategy that the developers at Weave devised. It is a method of determining the metrics that your end-users care about like speed and functionality of your site.
- (R)ate: number of requests per second a.k.a. Throughput
- (E)rrors: number of errors per second a.k.a. number of requests per second that returned a 5xx error
- (D)uration: the amount of time that it takes for each request to be served
The RED Method encourages you to have a standard overview of your systems that enable you to not only analyze how the application is responding to user interaction, but that can help debug future issues. This approach also helps you analyze how your application's behaviour changes over time. But even though these metrics are a good starting point, they are just that: a starting point. They can bring some insight into your stack but they won't tell you much about your infrastructure. (Follow our step by step tutorial and try it for yourself.)
A good complement to the RED metrics is to include some metrics about your cluster:
- Total amount of CPU available & being used
- Total amount of RAM available & being used
- Network traffic/throughput
- Storage (disk I/O, disk usage, etc.)
When you collect these metrics, you can correlate, for example, the amount of RAM that your application is consuming with the number of visitors that you have. This also helps you detect memory leaks, determine if your disks are filling up or assess the utilization of your infrastructure. (Read more about the importance of user-centric alerting on our blog.)
For SQL and NoSQL databases you may want to measure:
- Current number of connections
- Query duration
- Number of queries per second
These types of metrics can detect slow running queries and enable you to correlate slow queries with slow rendering pages on your site.
Ray Tsang showed us a process of debugging and troubleshooting microservices running in Kubernetes and gave us an overview of the tools available to you for root cause analysis
Carlos Leon from Container Solutions provided an overview on Monitoring applications in Kubernetes with Prometheus and how Weave Cloud simplifies monitoring a large scale app running in a cluster.
For more information on Prometheus see Monitoring Kubernetes
Join the Weave Online User Group for more talks or view the recordings on our youtube channel.