The tutorials Monitoring Microservices With Weave Cloud and Integrating Grafana with Weave Cloud provide great examples on how to specify and visualize metrics particularly with the RED method and its limitations, but you also need to react when those metrics indicate changes in your infrastructure or when your app is in distress. In this tutorial you will learn how to set up alerts, routes and receivers for your mission critical metrics.

The following topics are discussed:

An Introduction to Weave Cloud Alerts

Weave Cloud Monitor offers clear benefits over self-hosted Prometheus, such as horizontal scalability and near infinite data retention. It is also not necessary to run an Alertmanager. If you were to maintain your own Alertmanager, and for some reason it goes down, you won’t receive any notifications. If you’re using the Open Source version of Prometheus and Alertmanager, you must restart the services every time you update your configuration. This makes adjusting your alerts a tedious process. ConfigMaps also need manual updates and restarting pods can also be risky since any errors in the configuration files may prevent Prometheus or the Alertmanager from launching.

Prometheus Monitoring with Weave Cloud Monitor solves this problem: alerts, routes, and receivers can be adjusted on the fly and tweaked as your needs change and as your infrastructure scales. In the Weave Cloud admin panel, you can easily add lengthy and robust alerting rules without having to tear anything down.

Before You Begin

This tutorial assumes you already have a Kubernetes cluster setup with Weave Cloud, and that the Sock Shop is running in it.

See Monitoring Microservices With Weave Cloud or you can also use Setting up a Kubernetes Cluster on Digital Ocean if you’d like to try setting up a cluster with KubeAdm.

Requirements

Note: For your convenience, the Sock Shop has already been instrumented using the available Prometheus client libraries. For more information on how to do that see, Instrumenting Your code.

PromQL Queries with the RED Method

You will create some simple PromQL queries to represent the RED Method metrics:

  • (Request) Rate – the number of requests, per second, your services are serving
  • (Request) Errors – the number of failed requests per second
  • (Request) Duration – The amount of time each request takes expressed as a time interval

(Request) Rate

The Request Rate tracks how many requests are received for each service.


sum(

  rate(

    request_duration_seconds_count{

      job=~"^sock-shop.*",status_code=~"2..",route!="metrics"

    }[1m]

  )

) by

(name,instance,job)

In this query all jobs beginning with “sock-shop” with a status_code in the 200s are retrieved. The query excludes the /metrics route, which is the endpoint for Prometheus to scrape metrics.

Verifying the Query in Weave Cloud with PromQL

To validate your own expectations you can run it directly in Weave Cloud:

1. Within your Weave Cloud instance click Monitor where you will see the following:

2. Type a name for the notebook and click “save”. Weave Cloud allows us to create notebooks for collections of metrics and queries that you commonly use Change the name to “RED Metrics” and click save.

3. Now you can enter your PromQL directly into the field and press table or graph to return the data.

If you run it as a table, the results will look something like this:

You you have a Rate Query saved to your RED Metrics notebook. Let’s add some more queries to that.

(Request) Errors

In the field labeled “2” we will enter our next PromQL for errors.


sum(
  rate(
    request_duration_seconds_count{
      job=~"^sock-shop.*",status_code=~"4..|5..",route!="metrics"
    }[1m]
  )
) by
(name,instance,job,status_code)

This query returns a sum of all errors from the sock-shop and includes all error codes within the 400 or 500 range where the route does not match /metrics (metrics is the endpoint that is used to scrape the metrics).

Tyr adding this query to your “RED Metrics” notebook in Weave Cloud and then “Save” the notebook again.

(Request) Duration

Finally there are the Duration metrics.

histogram_quantile(0.95,
  sum(rate(request_duration_seconds_bucket{job=~"^sock-shop.*"}[1m]))
  by
  (name,instance,job,le)
)

This query uses a histogram_quantile as it is the sum and rate of an average time frame (duration). It is set with a 95th percentile, so that oddball outliers don’t trigger false positives.

Add this query to your notebook as well and then save the notebook again.

You should now have a notebook with all three RED metrics and as a reference is invaluable for diagnosing why your alerts may be firing at 2 am in the morning.

Some Terms and Definitions for Creating Alerts

Before you begin setting up the Alerts, it’s important to understand a few common terms:

  • Alerts are fired when a threshold is met within a PromQL query
  • Routes are rules on how to triage alerts as they are fired
  • Receivers are configurations for the tools to send alert notifications

Specifying Routes and Receivers

Before specifying any alerts you need to set up a Route and Receiver. The following is a list of supported receivers:

  • Email
  • Hipchat
  • Pagerduty
  • Pushover
  • Slack
  • Opsgenie
  • Victorops
  • Webhook

Because it is universally available, you will use requestb.in for a webhook receiver in this tutorial. However requestb.in is not meant for production systems. Get a URL by visiting https://request.bin.

Routes and Receivers follow the same conventions as the standard Prometheus Alertmanager configuration. From your Weave Cloud instance click the settings cog icon in the header and then click Configure -> Configure Alerting Receivers.”

Enter a simple route and receiver. This route is root level which means that it will apply to all alerts, however you can specify more complicated configurations based on any of the data coming through the AlertManager.

# AlertManager
route:
  group_by: ['cluster']
  receiver: 'requestbin_receiver'
  group_wait: 0s

receivers:
  - name: 'requestbin_receiver'
    webhook_configs:
      - url: 'ENTER_REQUESTBIN_URL_HERE'

This is YAML so spacing is important. After pressing Save, you’ve just configured your first route and receiver.

Specifying Alerting Rules

Now you can set up simple Alerts based on the PromQL that you defined earlier. Click on settings Cog in the header and then Config -> Alerting rules.

Alert rules in Weave Cloud are configured exactly the same as those in the OSS Version of Prometheus alert rules.

Either put all of your alert configurations together or load them separately in individual configurations. Every time a new rule configuration is saved you will be presented with a new field to enter a new configuration.

Your rules should look like as follows:


# Request alert

ALERT HighRequestRate

IF (sum(rate(request_duration_seconds_count{job=~"^sock-shop.*",status_code=~"2..",route!="metrics"}[1m])) by (name,instance,job)) > 10

FOR 10s

LABELS { severity = "warning" }

ANNOTATIONS {

  summary = "Job {{ $labels.job }} has high requests",

  description = "{{ $labels.instance }} of job {{ $labels.job }} has a high rate of requests.",

}


# Error Alert

ALERT HighErrorRate

IF (sum(rate(request_duration_seconds_count{job=~"^sock-shop.*",status_code=~"4..|5..",route!="metrics"}[1m])) by (name,instance,job,status_code)) > 10

FOR 10s

LABELS { severity = "warning" }

ANNOTATIONS {

 summary = "Job {{ $labels.instance }} has high errors",

 description = "{{ $labels.instance }} of job {{ $labels.job }} has a high error rate.",

}


# Duration Alert

ALERT LongDuration

IF histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{job=~"^sock-shop.*"}[1m])) by (name,instance,job,le)) > 1

FOR 10s

LABELS { severity = "warning" }

ANNOTATIONS {

  summary = "Job {{ $labels.job }} has long duration",

  description = "{{ $labels.instance }} of job {{ $labels.job }} has a long duration.",

}

Notice how the thresholds for when alarms should activate are included using Prometheus operators. You can also see how to generate a summary and description for the messages that will be sent by using templating.

After you’ve entered all of the rules (either individually or together) and saved them, you are ready to test out rule triggering.

Testing Your Alerts

You need two tools to test alerts. This example uses Siege to generate traffic, but you can use any load generating tool. In this case, we are running on Ubuntu so you can install it with:


$ sudo apt-get install siege

You will also make use of the Scope Traffic Control Plugin, which allows you to simulate latency for your requests.

To install to your K8s cluster just run:

$ kubectl create -f https://raw.githubusercontent.com/weaveworks-plugins/scope-traffic-control/master/deployments/k8s-traffic-control.yaml

Port Forward

Since you are targeting a service in the sock-shop, you will need to get a pod that you can target with your load generator.

$ kubectl get pods -n sock-shop
NAME                            READY     STATUS    RESTARTS   AGE
carts-3827086002-49f5l          1/1       Running   0          4m
carts-db-3114877618-ktlnl       1/1       Running   0          4m
catalogue-1468368541-0f8dm      1/1       Running   0          4m
catalogue-db-4178470543-7lpdd   1/1       Running   0          4m
front-end-370121486-7q2q8       0/1       Running   0          4m
orders-2403447817-nhd85         1/1       Running   0          4m
orders-db-98190230-k41b0        1/1       Running   0          4m
payment-3234301047-k8w5j        1/1       Running   0          4m
queue-master-3447951779-pkmnz   1/1       Running   0          4m
rabbitmq-3917772209-h21nx       1/1       Running   0          4m
session-db-97809841-br010       1/1       Running   0          4m
shipping-2367010433-hwhd7       1/1       Running   0          4m
user-3469057369-52n30           1/1       Running   0          4m
user-db-90590358-kwsj9          1/1       Running   0          4m

In this case, target the user service, and port forward to the user service pod and background the process:


$ kubectl port-forward user-3469057369-52n30 8000:80 -n sock-shop&

Reviewing the Alerts

While the alerts are triggering, visit your request bin inspect URL, example:


https://requestb.in/rcrz0prc?inspect

And you can also reload the “Firing Alerts” page within Monitor in Weave Cloud.

Note: Red highlighting added for Emphasis

You can also view your results by visiting the notebook you created earlier.

Triggering the Alerts

Rate

Lets trigger a high request rate:

$ siege --concurrent=60 --reps=100 http://localhost:8000/health

The command above hits a valid endpoint with 60 concurrent requests. You should get a 200 results on each.

Review your results as described above. It may take a few seconds for your results to manifest.

Errors

Lets trigger a high error rate:

$ siege --concurrent=60 --reps=100 http://localhost:8000/no_an_endpoint

Here you are hitting an invalid endpoint with 60 concurrent requests. You should get a 400 results on each.

Review your results again.

Duration

Your last test for duration requires you to visit Weave Cloud again. Click on Explore and Containers and then click on the user container. If you’ve successfully deployed the Scope Traffic Control Plugin you should see some new icons on the container:

The hourglasses adds levels of latency to requests. Click on the leftmost hourglass for max latency. Now you can again trigger at a valid endpoint with low concurrency, because you don’t want to trigger the rate or error alerts.


$ siege --concurrent=10 --reps=100 http://localhost:8000/health

Review your results.

Conclusions and Next Steps

Now that you have set up a baseline PromQL notebook and added some basic Alerts you can begin altering and testing your queries right within Weave Cloud to find the metrics that reflect your use case. See Integrating Grafana with Weave Cloud for even more dashboards.

Join the Weave Community

If you have any questions or comments you can reach out to us on our Slack channel. To invite yourself to the Community Slack channel, visit Weave Community Slack invite or contact us through one of these other channels at Help and Support Services.