We like continuous integration and testing at Weaveworks. This blog post tells the story of the journey we have been through, the tools and technologies we use, what worked and what didn’t, and the lessons we’ve learned.

A few months back, Weave testing consisted of some Go unit tests and a set of integration tests made of a couple of hundred lines of shell scripts (including the excellent assert.sh), Vagrant files, and glue. These tests were run manually by the developers working on the project. We initially started by setting up the ever popular TravisCI to automatically run the unit tests on every PR and push. For fast, iterative testing on Travis you want to use their recently released Docker support, however this prevents you from building Docker images themselves. We therefore left Travis to just build the code, run the unit tests, and upload the results to Coveralls.io.

To build the Weave docker images, test them, and run the existing integration tests, we started using CircleCI. CircleCI combines low spin-up time with the ability to build Docker images (even if their version of Docker is a little old). We use the CircleCI test environment as a coordinator for our distributed integration tests. Each run on CircleCI starts a set of VMs on Google Compute Engine, uploads the build to these VMs, and runs a series of integration tests against them. We chose GCE for the per-minute billing, allowing us to spin up short lived VMs and not worry about getting charged for the full hour.

Originally all this ran in a single VM on CircleCI, but increasingly our integration tests were taking longer and longer. CircleCI has a sharding feature, where you can run your build and test script concurrently on multiple VMs (“shards”). CircleCI will automatically balance tests across these shards for common test frameworks but unfortunately our home grown, cobbled together bash scripts didn’t fall into this category.

We started with a simple round-robin based approach to scheduling integration tests across shards, but as some tests take minutes, and some test take seconds, this lead us to have massively unequal shard run times. The Scheduler was born.

The Scheduler
The scheduler is a Python app that live on Google AppEngine. It has a simple REST API – one endpoint for recording test run time, and one endpoint for getting a schedule. It uses a naive greedy algorithm to assign tests to shards:

  • sort tests by run time (longest first)
  • for each test, pick the shard with the shortest aggregate runtime and add the test

Our CircleCI setup is configured to call out to the scheduler on each shard, and retrieve a list of tests to run on that shard. The scheduler has turned out to be very effective at balancing shard runtimes: our shortest shard is usually about 2 mins, and our longest about 2:30 mins. The scheduler has turned out to be pretty flexible—we now use the same code to schedule our unit tests across the shards.

As we needed a single place to atomically calculate our schedule, we decided to run this code on AppEngine. It would be no use each individual shard deciding what tests to run, as they might have slightly out-of-sync information and produce different plans, perhaps even resulting in some tests running on multiple shards, or not getting run on any shard.

The next challenge was one of coverage. Go has excellent tools for generating coverage reports from unit tests, but getting the same report from the integration tests seemed impossible. We wrote a very short shim, which launches the Weave router inside a unit test, ensures the command line argument handling is set up correctly, and ensure that the coverage report is dumped on exit. Combining this with some plumbing to gather the report from the test VMs / containers, we now had a large set of coverage reports, and no way to visualise them.

Some posts suggested we could just concatenate the coverage reports together, but this only works for non-overlapping coverage reports. To combine multiple coverage reports which touch the same codebase, we wrote a little tool. This parses the reports and intelligently merges them to produce coherent output. With this tool we were able to combine all the integration test coverage reports, and the unit test coverage report, into a single report for all our testing.

Once we successfully gathered code coverage from the integration tests, we were struck by how effective these tests were at covering our codebase. We have roughly 1.4k lines of shell scripts making up the integration tests, which provide around 70% code coverage for our 12k line codebase. OTOH we have roughly 5k lines of unit tests providing about 54% coverage. With our integration tests, under half the code provides 1.5x more coverage. The combined coverage of both sets of tests is around 83%.

The integration tests exercise functionality at the same level as a weave user/application. They thus succinctly capture what users care about. And they are very stable as a result — the code base can undergo substantial, behaviour-preserving changes without impacting the tests. However, the integration tests aren’t a panacea:

  • they take longer to run that the unit tests (about 10 mins total compared with about 1 min)
  • they require 3 VMs to run, something that won’t fit on many of our developers’ MacBook Airs
  • failures aren’t particularly isolated; a single bug in one module can cause all the tests to fail, without really pointing to the specific cause

For these reasons we continue to invest in unit tests.

Testing is one of those things for which you can throw an infinite amount of time at and yet there always seems to be more to do. In no particular order, here are some things we want to investigate further:

  • Most integration tests only use 1 VM, but have 3 anyway. We could run multiple single-VM tests concurrently with a slightly cleverer scheduler.
  • Getting aggregate (package level, file level & function level) coverage statistics is hard. We want to make a nicer report for this.
  • Performance testing—we want to regularly run longer performance stress tests to catch performance regressions, etc.
  • Flaky tests—it would be nice to run tests in a tight loop on idle resources, to catch flaky tests.