We have covered the RED Method at length in The Red Method: Key Metrics For Microservices Architecture and the tutorial on Grafana, Weave Cloud Monitor and Grafana, offers a great example on how to visualize those metrics, but we also want to react when those metrics indicate changes in your infrastructure. For that we will utilize Alerts in Weave Cloud Monitor.

Prometheus Alerts: An Overview

Weave Cloud Monitor offers clear benefits over self hosted Prometheus, such as horizontal scalability and near infinite data retention. It is also not necessary to run an Alertmanager. If you were to maintain your own Alertmanager, and for some reason it goes down, you will no longer receive any notifications. You can review all the benefits by reading up on Prometheus Monitoring in Weave Cloud.

If you’re using the Open Source version of Prometheus and Alertmanager, you must restart the services every time you update your configuration. This makes adjusting your alerts a tedious process. ConfigMaps need manual updates and restarting pods can also be risky since any errors in the configuration files may prevent Prometheus or the Alertmanager from launching.

Prometheus Monitoring with Weave Cloud Monitor solves this problem. Alerts, routes, and receivers can be adjusted on the fly and tweaked as your needs change or your infrastructure scales. In the Weave Cloud admin panel, you can easily add lengthy and robust alerting rules, and then apply them by pressing “Save.”

Setting Up Alerts

In this scenario we will assume you already have a Kubernetes cluster setup with Weave Cloud, and the Sock Shop running and we will just walk through how to setup the basics of for Alerts. You can read more about how to Setup a Kubernetes Cluster.

Requirements

The Socks shop has the instrumented code in place according to the guide for Instrumenting Your code.

The RED Method with PromQL

These are the three core alerts for the RED Method:

  • (Request) Rate – the number of requests, per second, your services are serving

  • (Request) Errors – the number of failed requests per second

  • (Request) Duration – The amount of time each request takes expressed as a time interval

You will create some simple PromQL queries to represent these metrics.

(Request) Rate

The Request Rate tracks how many requests are received for each service.


sum(

  rate(

    request_duration_seconds_count{

      job=~"^sock-shop.*",status_code=~"2..",route!="metrics"

    }[1m]

  )

) by

(name,instance,job)

In the above query we are looking for all jobs beginning with “sock-shop” and a status_code in the 200s, but excluding the /metrics route, which is the endpoint for Prometheus to scrape metrics.

You should run it directly in Weave Cloud to validate your expectations. Within your Weave Cloud instance click on “Monitor” and you will see the following screen.

You may notice you have a “Save” button on the top right hand corner of the screen. Weave Cloud allows us to create notebooks for metrics we commonly utilize, so let us name the notebook and save. Change the name “New Notebook” to “RED Metrics” and click save.

After creating a new notebook you can enter your PromQL directly into the field and press table or graph to return the data.

The results will look something like this if you run it as a table.

So now we have a Rate Query.

(Request) Errors

In the field labeled “2” we will enter our next PromQL for errors.


sum(
  rate(
    request_duration_seconds_count{
      job=~"^sock-shop.*",status_code=~"4..|5..",route!="metrics"
    }[1m]
  )
) by
(name,instance,job,status_code)

Once again we have a sum of all errors from sock-shop, with any error code in the 400 or 500 range where route does not match /metrics (metrics is the endpoint that is hit to scrape the metrics). “Save” the notebook again.

(Request) Duration

Finally we have the Duration metrics.

histogram_quantile(0.95,
  sum(rate(request_duration_seconds_bucket{job=~"^sock-shop.*"}[1m]))
  by
  (name,instance,job,le)
)

Here we have a quantile because it is the sum and rate of an average time frame (duration) with a 95th percentile, we will not let oddball outliers trigger false positives.

Add this to field number “3” and save the notebook again.

Now we have a notebook with all three RED metrics for reference. Which is valuable for diagnosing why our alerts may be firing.

Some Terms and Definitions for Creating Alerts

Before you begin setting up the Alerts, it’s important to understand some common terms. There are three core concepts:

Alerts are fired when a threshold is met within a PromQL query.

Routes are rules on how to triage alerts as they are fired.

Receivers are configurations for tools to send alert notifications.

Routes and Receivers

First, set up a route and receiver before specifying the alerts.

This is the current list of available receivers:

  • Email
  • Hipchat
  • Pagerduty
  • Pushover
  • Slack
  • Opsgenie
  • Victorops
  • Webhook

In this example, you will use requestb.in for a webhook receiver because it is universally available. However this is not meant for production systems. You can get a url by just visiting https://request.bin.

The Routes and Receivers follows the same conventions as the standard Alertmanager configuration. In your Weave Cloud instance under monitor click “Configure Alert Routes and Receivers.”

Note: Red highlighting added

Then enter a simple route and receiver. This route is root level which means that it will apply to all alerts, however you can specify more complicated configurations based on any of the data coming through the alertmanager.


# AlertManager
route:
  group_by: ['cluster']
  receiver: 'requestbin_receiver'
  group_wait: 0s

receivers:
  - name: 'requestbin_receiver'
    webhook_configs: 
      - url: 'ENTER_REQUESTBIN_URL_HERE'

This is YAML so spacing is important. After pressing Save, you’ve just configured your first route and receiver.

Alerts

Now you will set up simple Alerts based on the PromQL, you defined earlier. Click on “Define Alerting and Recording Rules.”

Note: Red highlighting added

Configure Alert rules in Weave Cloud exactly the same as those in the OSS Version of Prometheus alert rules.

Either put all of your alert configurations together or load them separately in individual configurations. Every time a new rule configuration is saved you will be presented with a new field to enter a new configuration.

Your rules should look like this:


# Request alert

ALERT HighRequestRate

IF (sum(rate(request_duration_seconds_count{job=~"^sock-shop.*",status_code=~"2..",route!="metrics"}[1m])) by (name,instance,job)) > 10

FOR 10s

LABELS { severity = "warning" }

ANNOTATIONS {

  summary = "Job  has high requests",

  description = " of job  has a high rate of requests.",

}


# Error Alert

ALERT HighErrorRate

IF (sum(rate(request_duration_seconds_count{job=~"^sock-shop.*",status_code=~"4..|5..",route!="metrics"}[1m])) by (name,instance,job,status_code)) > 10

FOR 10s

LABELS { severity = "warning" }

ANNOTATIONS {

 summary = "Job  has high errors",

 description = " of job  has a high error rate.",

}


# Duration Alert

ALERT LongDuration

IF histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{job=~"^sock-shop.*"}[1m])) by (name,instance,job,le)) > 1

FOR 10s

LABELS { severity = "warning" }

ANNOTATIONS {

  summary = "Job  has long duration",

  description = " of job  has a long duration.",

}

Notice that thresholds for when alarms should activate are included using Prometheus operators. You can also see how to generate a summary and description for the messages we will send by using templating.

After you’ve entered all of the rules (either individually or together) and saved them, you are ready to test out rule triggering.

Testing Your Alerts

You need two tools to test alerts. In this example we use Siege to generate traffic, but you can use any load generating tool. In this case, we are running on Ubuntu so you can install with:


$ sudo apt-get install siege

You will also utilize the Scope Traffic Control Plugin. Which will allow you to add latency to your requests.

To install to your K8s cluster just run:

$ kubectl create -f https://raw.githubusercontent.com/weaveworks-plugins/scope-traffic-control/master/deployments/k8s-traffic-control.yaml

Port Forward

You will be targeting a service in the sock-shop so first you will get a pod that you can target with your load generator.

$ kubectl get pods -n sock-shop
NAME                            READY     STATUS    RESTARTS   AGE
carts-3827086002-49f5l          1/1       Running   0          4m
carts-db-3114877618-ktlnl       1/1       Running   0          4m
catalogue-1468368541-0f8dm      1/1       Running   0          4m
catalogue-db-4178470543-7lpdd   1/1       Running   0          4m
front-end-370121486-7q2q8       0/1       Running   0          4m
orders-2403447817-nhd85         1/1       Running   0          4m
orders-db-98190230-k41b0        1/1       Running   0          4m
payment-3234301047-k8w5j        1/1       Running   0          4m
queue-master-3447951779-pkmnz   1/1       Running   0          4m
rabbitmq-3917772209-h21nx       1/1       Running   0          4m
session-db-97809841-br010       1/1       Running   0          4m
shipping-2367010433-hwhd7       1/1       Running   0          4m
user-3469057369-52n30           1/1       Running   0          4m
user-db-90590358-kwsj9          1/1       Running   0          4m

In this case, you will target the user service, so you will port forward to the user service pod and background the process:


$ kubectl port-forward user-3469057369-52n30 8000:80 -n sock-shop&

Reviewing the Alerts

While triggering the alerts you will visit your request bin inspect url, example:


https://requestb.in/rcrz0prc?inspect

And you can also reload the “Firing Alerts” page within weave scope.

Note: Red highlighting added

You can also view your results by visiting the notebook you created earlier.

Triggering the Alerts

Rate

Lets first trigger a high request rate

$ siege --concurrent=60 --reps=100 http://localhost:8000/health

Here you are hitting a valid endpoint with 60 concurrent requests. You should get a 200 results on each.

Review your results as described above. It may take a few seconds for your results to manifest.

Errors

Lets trigger a high error rate

$ siege --concurrent=60 --reps=100 http://localhost:8000/no_an_endpoint

Here you are hitting an invalid endpoint with 60 concurrent requests. You should get a 400 results on each.

Review your results again.

Duration

Your last test for duration requires you to visit Weave Cloud again. If you click on “Explore” and choose “Containers” then click on the “user” container. If you successfully deployed the Scope Traffic Control Plugin you should see some new icons on the container:

The hourglasses will add levels of latency to requests. Click on the leftmost hourglass for max latency. Now you will trigger again at a valid endpoint with low concurrency, because you don’t want to trigger the rate or error alerts.


$ siege --concurrent=10 --reps=100 http://localhost:8000/health

Review your results.

Conclusion

Now that you have set up a baseline PromQL notebook and some basic Alerts you can begin altering and testing your queries right within Weave Cloud to find the metrics that reflect your use case.