Whodunnit? Debugging and Diagnosing Microservices with Weave Cloud Beta
Debugging is about understanding the system. If software is behaving in some unexpected way, it means there is some piece of the system you don’t understand. Microservices, in particular, spread the system’s behaviour across many...
Debugging is about understanding the system. If software is behaving in some unexpected way, it means there is some piece of the system you don’t understand. Microservices, in particular, spread the system’s behaviour across many components and this can make it more difficult to predict how the system should act.
There are several ways of categorizing failures in a program. Fatal errors, like failing an assertion can cause the entire program to abort, whereas, non-fatal errors, like logging an error message, allow the program to continue. Failures can be categorized into implicit and explicit failures. Explicit failures are specific conditions built into the program by the developer. For example, assertion failures, are a great example of an explicit failure. Implicit failures, on the other hand, arise naturally and are not predetermined by the developer. The classic example being, a buffer overflow failure, where the developer did not explicitly tell the program to fail in those conditions, but it does implicitly.
With a microservices based approach, almost all of our failures are implicit and non-fatal. Services are sometimes unavailable, leaking resources, or behaving pathologically. Even worse, they frequently cause cascading failures, where the symptoms may only be visible 2 or 3 services removed from the underlying fault.
Introducing Sock Shop
Our demo application “Sock Shop” is built with microservices and cloud-native technologies, and is available on github, so you can follow along.
I’ve followed the instructions on the sock shop repo to set it up on a Kubernetes cluster, but something is wrong. The page above should be showing our catalogue full of socks, but as we can see it is only coming up blank. Let’s see how we can use Weave Cloud to debug and understand this failure.
Troubleshooting in Weave Cloud
Weave Cloud draws an architecture diagram of our system, based on the actual traffic observed.
One of the most common tools, and a great place to start, when debugging distributed systems is to “walk the request path”. Armed with an understanding of how our system should be working, we can trace through the requests which are actually happening, and see where they diverge from our expectations.
Other tools, such as Zipkin, can also be very useful for this type of debugging. But, unlike them, Weave Cloud does not require us to instrument our application at all. This means that Weave Cloud can show connections happening through third-party software, like nginx, or HAProxy.
In this example, our request, starts at the front-end service. The front-end routes our request to the catalogue service, but then the request chain stops. When building the system we know to expect that the catalogue service makes requests against the catalogue-db, so we can assume that there is some sort of error happening here.
Weave Cloud lets us run a “docker attach” against any container quickly and easily, right from the browser. We can confirm that the error is in the “catalogue” service by attaching to it, and seeing if there are errors being logged.
Let’s click the “attach” button on our catalogue container (it’s the monitor one):
Ah ha! We can see that for each request, there are some errors being logged. In particular this line looks suspicious:
ts=2016-10-17T13:26:37Z caller=middlewares.go:78 method=Tags result=0 err="database connection error" took=4.056589509s
We’ve seen a “database connection error”, and we’re hot on the trail!
Software debugging is one of the purest forms of experimentation. Everything is deterministic. This makes it an excellent place to apply the scientific method.
Let’s review the symptoms we are seeing: no data coming from the catalogue, and errors appearing in the catalogue logs. We can hypothesize that there may be an error connecting to the catalogue database. Let’s do an experiment to test this.
Weave cloud provides a quick and easy way to run try things out on your system, with the “exec” functionality. We’ll open a “docker exec” into one of our catalogue containers, by clicking the “exec” button. This container can be running anywhere that we have deployed a Weave Cloud probe, and we will be seamlessly connected to it right from the browser.
From here, let’s try to ping our catalogue db.
/ # ping catalogue-db PING catalogue-db (10.0.0.200): 56 data bytes ^C --- catalogue-db ping statistics --- 17 packets transmitted, 0 packets received, 100% packet loss
We can clearly confirm our hypothesis, that the catalogue container is unable to communicate with the database.
In this case, the fix is simply deploying the catalogue-db service. Once we do that, we’ll be up and selling socks again! Deploying the fix is as simple as running:
kubectl apply -f catalogue-db-deployment.yaml
Hey, presto! We’ve got our catalogue back, and we can verify that the architecture looks like we expect with Weave Cloud:
The catalogue-db service is running, and traffic is reaching it, just as expected.
If you’re interested in getting a detailed view of your microservices, sign up for Weave Cloud.