The tutorials Monitoring Microservices With Weave Cloud and Integrating Grafana with Weave Cloud provide great examples on how to specify and visualize metrics particularly with the RED method and its limitations, but you also need to react when those metrics indicate changes in your infrastructure or when your app is in distress. In this tutorial you will learn how to set up alerts, routes and receivers for your mission critical metrics.
The following topics are discussed:
- An Introduction to Weave Cloud Alerts
- PromQL Queries with the RED Method
- Some Terms and Definitions for Creating Alerts
- Testing Your Alerts
- Triggering the Alerts
- Conclusions and Next Steps
An Introduction to Weave Cloud Alerts
Weave Cloud Monitor offers clear benefits over self-hosted Prometheus, such as horizontal scalability and near infinite data retention. It is also not necessary to run an Alertmanager. If you were to maintain your own Alertmanager, and for some reason it goes down, you won’t receive any notifications. If you’re using the Open Source version of Prometheus and Alertmanager, you must restart the services every time you update your configuration. This makes adjusting your alerts a tedious process. ConfigMaps also need manual updates and restarting pods can also be risky since any errors in the configuration files may prevent Prometheus or the Alertmanager from launching.
Prometheus Monitoring with Weave Cloud Monitor solves this problem: alerts, routes, and receivers can be adjusted on the fly and tweaked as your needs change and as your infrastructure scales. In the Weave Cloud admin panel, you can easily add lengthy and robust alerting rules without having to tear anything down.
Before You Begin
This tutorial assumes you already have a Kubernetes cluster setup with Weave Cloud, and that the Sock Shop is running in it.
See Monitoring Microservices With Weave Cloud or you can also use Setting up a Kubernetes Cluster on Digital Ocean if you’d like to try setting up a cluster with KubeAdm.
Requirements
Note: For your convenience, the Sock Shop has already been instrumented using the available Prometheus client libraries. For more information on how to do that see, Instrumenting Your code.
PromQL Queries with the RED Method
You will create some simple PromQL queries to represent the RED Method metrics:
- (Request) Rate – the number of requests, per second, your services are serving
- (Request) Errors – the number of failed requests per second
- (Request) Duration – The amount of time each request takes expressed as a time interval
(Request) Rate
The Request Rate tracks how many requests are received for each service.
sum(
rate(
request_duration_seconds_count{
job=~"^sock-shop.*",status_code=~"2..",route!="metrics"
}[1m]
)
) by
(name,instance,job)
In this query all jobs beginning with “sock-shop” with a status_code in the 200s are retrieved. The query excludes the /metrics
route, which is the endpoint for Prometheus to scrape metrics.
Verifying the Query in Weave Cloud with PromQL
To validate your own expectations you can run it directly in Weave Cloud:
1. Within your Weave Cloud instance click Monitor where you will see the following:
2. Type a name for the notebook and click “save”. Weave Cloud allows us to create notebooks for collections of metrics and queries that you commonly use Change the name to “RED Metrics” and click save.
3. Now you can enter your PromQL directly into the field and press table or graph to return the data.
If you run it as a table, the results will look something like this:
You you have a Rate Query saved to your RED Metrics notebook. Let’s add some more queries to that.
(Request) Errors
In the field labeled “2” we will enter our next PromQL for errors.
sum(
rate(
request_duration_seconds_count{
job=~"^sock-shop.*",status_code=~"4..|5..",route!="metrics"
}[1m]
)
) by
(name,instance,job,status_code)
This query returns a sum of all errors from the sock-shop and includes all error codes within the 400 or 500 range where the route does not match /metrics (metrics is the endpoint that is used to scrape the metrics).
Tyr adding this query to your “RED Metrics” notebook in Weave Cloud and then “Save” the notebook again.
(Request) Duration
Finally there are the Duration metrics.
histogram_quantile(0.95,
sum(rate(request_duration_seconds_bucket{job=~"^sock-shop.*"}[1m]))
by
(name,instance,job,le)
)
This query uses a histogram_quantile as it is the sum and rate of an average time frame (duration). It is set with a 95th percentile, so that oddball outliers don’t trigger false positives.
Add this query to your notebook as well and then save the notebook again.
You should now have a notebook with all three RED metrics and as a reference is invaluable for diagnosing why your alerts may be firing at 2 am in the morning.
Some Terms and Definitions for Creating Alerts
Before you begin setting up the Alerts, it’s important to understand a few common terms:
- Alerts are fired when a threshold is met within a PromQL query
- Routes are rules on how to triage alerts as they are fired
- Receivers are configurations for the tools to send alert notifications
Specifying Routes and Receivers
Before specifying any alerts you need to set up a Route and Receiver. The following is a list of supported receivers:
- Hipchat
- Pagerduty
- Pushover
- Slack
- Opsgenie
- Victorops
- Webhook
Because it is universally available, you will use requestb.in
for a webhook receiver in this tutorial. However requestb.in
is not meant for production systems. Get a URL by visiting https://request.bin.
Routes and Receivers follow the same conventions as the standard Prometheus Alertmanager configuration. From your Weave Cloud instance click the settings cog icon in the header and then click Configure -> Configure Alerting Receivers.”
Enter a simple route and receiver. This route is root level which means that it will apply to all alerts, however you can specify more complicated configurations based on any of the data coming through the AlertManager.
# AlertManager
route:
group_by: ['cluster']
receiver: 'requestbin_receiver'
group_wait: 0s
receivers:
- name: 'requestbin_receiver'
webhook_configs:
- url: 'ENTER_REQUESTBIN_URL_HERE'
This is YAML so spacing is important. After pressing Save, you’ve just configured your first route and receiver.
Specifying Alerting Rules
Now you can set up simple Alerts based on the PromQL that you defined earlier. Click on settings Cog in the header and then Config -> Alerting rules.
Alert rules in Weave Cloud are configured exactly the same as those in the OSS Version of Prometheus alert rules.
Either put all of your alert configurations together or load them separately in individual configurations. Every time a new rule configuration is saved you will be presented with a new field to enter a new configuration.
Your rules should look like as follows:
# Request alert
ALERT HighRequestRate
IF (sum(rate(request_duration_seconds_count{job=~"^sock-shop.*",status_code=~"2..",route!="metrics"}[1m])) by (name,instance,job)) > 10
FOR 10s
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Job {{ $labels.job }} has high requests",
description = "{{ $labels.instance }} of job {{ $labels.job }} has a high rate of requests.",
}
# Error Alert
ALERT HighErrorRate
IF (sum(rate(request_duration_seconds_count{job=~"^sock-shop.*",status_code=~"4..|5..",route!="metrics"}[1m])) by (name,instance,job,status_code)) > 10
FOR 10s
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Job {{ $labels.instance }} has high errors",
description = "{{ $labels.instance }} of job {{ $labels.job }} has a high error rate.",
}
# Duration Alert
ALERT LongDuration
IF histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{job=~"^sock-shop.*"}[1m])) by (name,instance,job,le)) > 1
FOR 10s
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Job {{ $labels.job }} has long duration",
description = "{{ $labels.instance }} of job {{ $labels.job }} has a long duration.",
}
Notice how the thresholds for when alarms should activate are included using Prometheus operators. You can also see how to generate a summary and description for the messages that will be sent by using templating.
After you’ve entered all of the rules (either individually or together) and saved them, you are ready to test out rule triggering.
Testing Your Alerts
You need two tools to test alerts. This example uses Siege to generate traffic, but you can use any load generating tool. In this case, we are running on Ubuntu so you can install it with:
$ sudo apt-get install siege
You will also make use of the Scope Traffic Control Plugin, which allows you to simulate latency for your requests.
To install to your K8s cluster just run:
$ kubectl create -f https://raw.githubusercontent.com/weaveworks-plugins/scope-traffic-control/master/deployments/k8s-traffic-control.yaml
Port Forward
Since you are targeting a service in the sock-shop, you will need to get a pod that you can target with your load generator.
$ kubectl get pods -n sock-shop
NAME READY STATUS RESTARTS AGE
carts-3827086002-49f5l 1/1 Running 0 4m
carts-db-3114877618-ktlnl 1/1 Running 0 4m
catalogue-1468368541-0f8dm 1/1 Running 0 4m
catalogue-db-4178470543-7lpdd 1/1 Running 0 4m
front-end-370121486-7q2q8 0/1 Running 0 4m
orders-2403447817-nhd85 1/1 Running 0 4m
orders-db-98190230-k41b0 1/1 Running 0 4m
payment-3234301047-k8w5j 1/1 Running 0 4m
queue-master-3447951779-pkmnz 1/1 Running 0 4m
rabbitmq-3917772209-h21nx 1/1 Running 0 4m
session-db-97809841-br010 1/1 Running 0 4m
shipping-2367010433-hwhd7 1/1 Running 0 4m
user-3469057369-52n30 1/1 Running 0 4m
user-db-90590358-kwsj9 1/1 Running 0 4m
In this case, target the user service, and port forward to the user service pod and background the process:
$ kubectl port-forward user-3469057369-52n30 8000:80 -n sock-shop&
Reviewing the Alerts
While the alerts are triggering, visit your request bin inspect URL, example:
https://requestb.in/rcrz0prc?inspect
And you can also reload the “Firing Alerts” page within Monitor in Weave Cloud.
Note: Red highlighting added for Emphasis
You can also view your results by visiting the notebook you created earlier.
Triggering the Alerts
Rate
Lets trigger a high request rate:
$ siege --concurrent=60 --reps=100 http://localhost:8000/health
The command above hits a valid endpoint with 60 concurrent requests. You should get a 200 results on each.
Review your results as described above. It may take a few seconds for your results to manifest.
Errors
Lets trigger a high error rate:
$ siege --concurrent=60 --reps=100 http://localhost:8000/no_an_endpoint
Here you are hitting an invalid endpoint with 60 concurrent requests. You should get a 400 results on each.
Review your results again.
Duration
Your last test for duration requires you to visit Weave Cloud again. Click on Explore and Containers and then click on the user container. If you’ve successfully deployed the Scope Traffic Control Plugin you should see some new icons on the container:
The hourglasses adds levels of latency to requests. Click on the leftmost hourglass for max latency. Now you can again trigger at a valid endpoint with low concurrency, because you don’t want to trigger the rate or error alerts.
$ siege --concurrent=10 --reps=100 http://localhost:8000/health
Review your results.
Conclusions and Next Steps
Now that you have set up a baseline PromQL notebook and added some basic Alerts you can begin altering and testing your queries right within Weave Cloud to find the metrics that reflect your use case. See Integrating Grafana with Weave Cloud for even more dashboards.
Join the Weave Community
If you have any questions or comments you can reach out to us on our Slack channel. To invite yourself to the Community Slack channel, visit Weave Community Slack invite or contact us through one of these other channels at Help and Support Services.