No matter how well-engineered your software is, how ready it is to be scaled both vertically and horizontally, once your application is out in the wild, you need the tools and the means to keep a close eye on it so that you can understand how it behaves under real load, how to find the bottlenecks, and query problems and how to determine the points of failure when your system goes down.

The following topics are discussed in this tutorial:

What to Measure?

Your system has so many moving parts that it is sometimes difficult to make a decision on where to start or what exactly to measure. This problem is somewhat easier once you understand The RED Method, but as you will learn, the RED method on its own doesn’t always provide you with the complete picture.

RED Method: Application Level Metrics

To summarize the RED method, for each of your services, you would gather:

  • Rate: number of requests per second a.k.a. throughput
  • Errors: number of errors per second a.k.a. number of requests per second that returned a 5xx error
  • Duration: the amount of time that it takes for each request to be served

The RED Method encourages you to have a standard overview of your systems that enable you to not only analyze how the application is responding to user interaction, but that can help debug any future arising issues. This approach also helps you analyze how your application’s behaviour changes over time. But even though these metrics are a good starting point, they are just that: a starting point. They can bring some insight into your stack but they won’t tell you much about your infrastructure.

Cluster Metrics

A good complement to the RED metrics is to include some metrics about your cluster:

  • Total amount of CPU available & being used
  • Total amount of RAM available & being used
  • Network traffic/throughput
  • Storage (disk I/O, disk usage, etc.)

When you collect these metrics, you can correlate, for example, the amount of RAM that your application is consuming with the number of visitors that you have. This also helps you detect memory leaks, determine if your disks are filling up or assess the utilization of your infrastructure.

Database Metrics

For SQL and NoSQL databases you can measure:

  • Current number of connections
  • Query duration
  • Number of queries per second

These metrics can detect slow running queries and enable you to correlate slow queries with slow rendering pages on your site.

Other Useful Metrics

Other metrics to consider:

  • 3rd party API’ rate, errors and duration. This can help you detect if the bottleneck in an endpoint at a certain point in time is either on your side or on the external API’s side of things
  • Job Queues. If you have a pool of background workers (for processing video, audio, images, bulk sending emails and tasks of that sort), measure the amount of jobs that you currently have in the queue. Bonus points if you can measure the amount of jobs processed since the last reading

How to collect metrics from your infrastructure and applications

There are several ways to get the metrics out of your system and many vaults where you can store them:

  • The Push Method
    • Metrics can be pushed directly from your application to your data store
    • You can have an agent sitting right next to your application pulling the metrics out of it and then pushing them into the data store
  • The Pull Method
    • You can have an agent that remotely pulls the data out of your application and pushes the metrics into the data store

Special note on short-lived services: Some services are ephemeral and for these types of applications, the pull method might not be apt. You will then need to push the metrics from the service itself.

Where to Store Metrics

To make a decision on the kind of storage to use for our metrics, it’s important to establish the criteria and nature of the data:

  1. The data you’re collecting is immutable. Take as an example the metric: duration for each request on the endpoints of your API. You most likely will never update any of these records.
  2. Since you will collect this data over a constant number of seconds, you can safely assume that most of the dataset will already be sorted (by time!)

Kubernetes built-in Endpoint

Kubernetes comes with a built-in endpoint for metrics which is supported by Prometheus.When you deploy the Weave Cloud probes on your cluster, these will automatically gather all the metrics from Kubernetes and later on you will be able to query and graph them using Weave Monitor.

Before you Begin

What You Will Use

Requirements

Sign up for a Google Cloud account

Creating a Google Cloud account is pretty straight forward. Once your account is created, make sure that you enable billing.

Create a Kubernetes cluster

You will use the hosted Kubernetes version that Google Cloud offers, as it is the easiest way to get a cluster up and running.

1. Login into your Google Cloud account and find the Container Engine section.

Click Create a container cluster and follow the instructions.

2. When asked about the details of the cluster:

  • Pick a good name for your cluster
  • Select a Zone that is close to your physical location
  • We recommend a cluster of at least 7.5GB and at least 2 vCPUs for deploying The Sock Shop and the rest of the Weave probes.
  • Since this cluster is only for testing purposes, we recommend disabling
    • Automatic upgrades
    • Automatic repair

After filling in the details, click the Create button.

3. Wait for your cluster to become available. This operation might take between 5 and 10 minutes.

4. Once your cluster is ready, click on its Connect button. This will bring a dialogue up with the Google Cloud CLI configuration command

5. Authenticate with Google Cloud from the cloud terminal:

`gcloud auth login`

The output should be something like this:

```
Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?redirect_uri=...


    Enter verification code:
```

In your browser, open the link that was provided by the previous command and follow the instructions.

6. Get the credentials required by kubectl to authenticate with your Kubernetes cluster:

```
gcloud container clusters get-credentials <cluster-name> \
             --zone <zone> --project <project-id>
```

Ensure that you replace the cluster-name, zone and project-id placeholders with the values that the Google Cloud is giving you.

The output should be as follows:

```
Fetching cluster endpoint and auth data.
kubeconfig entry generated for cluster-1.
```

7. Verify that you can talk to the cluster:

```
kubectl cluster-info
```    The output should yield something like this:

```
Kubernetes master is running at https://1.2.3.4
GLBCDefaultBackend is running at https://1.2.3.4/api/v1/proxy/namespaces/kube-system/services/default-http-backend
Heapster is running at https://1.2.3.4/api/v1/proxy/namespaces/kube-system/services/heapster
KubeDNS is running at https://1.2.3.4/api/v1/proxy/namespaces/kube-system/services/kube-dns
kubernetes-dashboard is running at https://1.2.3.4/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
```

Sign Up for Weave Cloud & Connect the Agents to Kubernetes

Before you can use Weave Monitor to monitor apps, you’ll need to:

1. Sign up for Weave Cloud.

2. From the set up screens that appear, choose Platform Kubernetes –> Environment GKE and then copy the command.

3. Paste this command into your GKE terminal.

After a few moments your cluster should be successfully connected to Weave Cloud.

Note: Weave Cloud requires elevated permissions which your user account will not be able to grant without additional configuration. These permissions are already set up in the command shown to you from the Weave Cloud set up screens.

For more information see, Cluster Role Binding

Monitoring with Weave Cloud

At this point, the Weave Cloud agents should have been deployed to your Kubernetes cluster. Monitor automatically collect several metrics from your cluster, including:

  • Kubernetes built-in metrics
  • Any metrics generated by your services as long as your services expose metrics on the standard /metrics endpoint

To visualize these metrics:

1. Go to your Weave Cloud instance and click Monitor. You will be presented with a list of preconfigured System Notebooks

2. Check the Node Resources notebook:

The Kubernetes notebook:

And the Weave Net notebook:

Notice that there are no metrics for this since GKE clusters use their own container networking plugin.

Deploy ‘The Sock Shop’ to Kubernetes

Deploy the microservices reference application, The Sock Shop, along with a load test service to generate some metrics that later on you will be able to visualize in Weave Cloud.

To install The Sock Shop, run the following in the Google terminal:

git clone https://github.com/microservices-demo/microservices-demo microservices-demo
cd microservices-demo
kubectl create namespace sock-shop
kubectl apply -f deploy/kubernetes/manifests

It may take a few minutes before the application is completely ready and generating metrics that can be collected and displayed.

Go to Weave Cloud, click Monitor and create a new notebook for the Sock Shop. (Notebooks are what we refer to as collections of queries for a particular service, application or even incident. They can be shared with your colleagues as well. See,Understanding Prometheus Notebooks for more information on how to use them.

The following screen capture displays a sample query that shows the request rate for each of the services in the Shop, both for HTTP 200 status codes as well as for HTTP 500 errors.

Conclusions & Next Steps

Up to here you have a source of truth to gauge how your system is behaving in production but this is only part of your job. You probably also need to attend meetings, write some code, do some server maintainance, eat, sleep, etc. For these specific scenarios you need an alerting system that can look into the data from your metrics, analyse it and react on a certain criteria. For example, if in the past 5 minutes your front end application has been returning more HTTP 500 status codes than HTTP 200, then you probably want this alerting system to notify you about it.

For this we have written the Configuring Alerts, Routes and Receivers with the RED Method tutorial.

Join the Weave Community

If you have any questions or comments you can reach out to us on our Slack channel. To invite yourself to the Community Slack channel, visit Weave Community Slack invite or contact us through one of these other channels at Help and Support Services.