If you use Prometheus, then you probably use Grafana. At Weave, we have Grafana dashboards for all of our microservices. When we want to understand our system, our Grafana dashboards are the first things we look at.

To make the most out of Grafana, you must put your dashboards and configuration in version control. Once that’s done, you can treat them like any other code or configuration. You can review changes, test those changes with CI, continuously deploy them with the CD tool of your choice, and, if necessary, roll those changes back. Once your Grafana dashboards are in a Git repository, everything just becomes simpler.

Grafana doesn’t make it easy to do this. But here’s how.

  1. Build & deploy your own custom Grafana container
  2. Use gfdatasource to point it at your Prometheus
  3. Use grafanalib for easy-to-use, reproducible dashboards

1. Build a Grafana Container

Make a directory in a Git repository for keeping all of your Grafana configuration. Add a Dockerfile that looks like:


FROM grafana/grafana

COPY grafana-defaults.ini grafana.ini /etc/grafana/
COPY *.json /grafana/dashboards/

ENTRYPOINT ["/usr/sbin/grafana-server", "--homepath=/usr/share/grafana", "--config=/etc/grafana/grafana.ini"]

And make a grafana.ini that meets your needs. It must at least have:


[dashboards.json]
enabled = true
path = /grafana/dashboards

See the Grafana documentation to learn more.

2. Point Grafana at Prometheus

You could follow the instructions in the Prometheus documentation and go to the Grafana UI to configure it, but then you would have to do that each and every time you deploy a new Grafana images, and we want to continuously deploy all the things. So, we’re going to cheat.

Run:


docker run weaveworks/gfdatasource:latest \
          --grafana-url=http://grafana.monitoring.svc.cluster.local:80/api \
          --data-source-url=http://prometheus.monitoring.svc.cluster.local/ \
          --name="My Little Prometheus" \
          --type=prometheus \
          --update-interval=10

This will run forever, and every ten seconds will tell the Grafana at localhost that there is a Prometheus data source called “My Little Prometheus” at http://prometheus.monitoring.svc.cluster.local/admin/prometheus/.

In our cluster, we run gfdatasource as a sidecar in our grafana pods, but you don’t have to do it like that.

3. Use grafanalib

At this point, you can start putting your dashboards into version control. Go to a Grafana dashboard, click the Share icon, choose Export,

and then “Save to file”. Your browser will save a JSON file that you can then move into your Git repository.

Unfortunately, what you’ll have is thousands of lines of mostly meaningless JSON with lots of duplication. This makes code review a pain, and makes it hard to keep your dashboards consistent. For example, we wanted:

  • sorted keys in JSON objects, to reduce diff size
  • unique graph IDs, otherwise Grafana would break
  • all stacked graphs to be 0-based with the tooltip showing individual, rather than cumulative layout
  • successful requests in green and errors in red
  • one pair of “RED method” graphs per row, with the number of queries on the left and the latency on the right

Many of these were discussed in our previous post on Designing Effective Dashboards.

You can get around this for a while using custom lint scripts that look at the JSON and tell you if you have got anything wrong—that’s what we did at first. It’s not ideal, but you can manage.

But if you switch to grafanalib, everything becomes wonderful.

Here’s what our dashboard definition for Scope looks like now:


import grafanalib.core as G  # General Grafana objects
import grafanalib.weave as W  # Weaveworks-specific customization

dashboard = G.Dashboard(
  title="Scope > Services",
  rows=[
    scope_row('Collection', 'scope/collection'),
    scope_row('Query', 'scope/query'),
    scope_row('Control', 'scope/control'),
    scope_row('Demo', 'extra/demo'),
  ],
)

This is a Python module in a file called scope-services.dashboard.py that defines a single special dashboard variable and is evaluated by the gen-dashboard script in grafanalib to produce a JSON Grafana dashboard.

It means that all of the graphs for the core Scope services look the same. What do they look like?

scope_row is defined as:


import itertools
GRAPH_ID = itertools.count(1)

def scope_row(name, job):
  return G.Row(
    panels=[
      scope_qps_graph(name, job, next(GRAPH_ID)),
      scope_latency_graph(name, job, next(GRAPH_ID)),
    ]
  )

Each row has a “QPS” graph, which shows queries per second broken down by response code, and a latency graph, which shows the median and 99th percentile latency in milliseconds. Here’s how the latency graph is defined:


def scope_latency_graph(name, job, id):
  return W.PromGraph(
    title='%s Latency' % (name,),
    dataSource="My Little Prometheus",
    expressions=[
      ('99th quantile',
       'job:scope_request_duration_seconds:99quantile{job="%s"} * 1e3' % (job,)),
      ('50th quantile',
       'job:scope_request_duration_seconds:50quantile{job="%s"} * 1e3' % (job,)),
    ],
    id=id,
    yAxes=[
      G.YAxis(format=G.MILLISECONDS_FORMAT),
      G.YAxis(format=G.SHORT_FORMAT),
    ],
  )


PromGraph is a graph that assumes all of its metrics are Prometheus expressions. Note that you pass it the name of the data source that you configured with gfdatasource, so it fetches from your Prometheus.

Once you’ve migrated your dashboards to use grafanalib, update your build process to run gen-dashboard before it builds the Docker image for Grafana that you set up in Step 1 earlier. At Weaveworks, we use a Makefile for this, but you can use whatever works for your team.

The big advantages of grafanalib are that it makes it really easy to have consistent, powerful dashboards. The downside is that you can’t design a dashboard in Grafana’s UI and then export it as a grafanalib definition. The Weave Cloud team so far have found this to be a happy trade-off.

Conclusion

Now your Grafana configuration is managed entirely from a source control repository, which means you can do code review, CI, CD, and rollbacks. Using grafanalib, you can build consistent, powerful dashboards that can easily extend to new services.