With the proliferation of new services being added to Weave Cloud, I thought now would be a good time to write down some of the best practices we’ve learned from running Weave Cloud for over a year.


These are not hard-and-fast rules, but practices we’ve learned. I’ve tried to include a justification as to why. They are open for discussion and review. They should evolve as we learn more. In no particular order:

Naming and organization

Not the more interesting of topics, but naming is always something that seems to elicit a lot of discussion. I don’t actually care how things are named, but you should name things consistently:

What? Name code repos, container images, namespaces and services consistently. Using a common prefix is always a good idea. For example:

Cortex is made up of 6 different services. All these service live in the cortex namespace, and are almost* all are built from source in the github.com/weaveworks/cortex.git repo. All the container images share the common prefix quay.io/weaveworks/cortex- and then are named after the service, eg quay.io/weaveworks/cortex-distributor or quay.io/weaveworks/cortex-ingester.
* some images, such as memcache, use upstream “official” images.

Why? Naming things consistently makes it easier to rely on intuition during incident handling, and in general makes it easier to navigate the many moving parts of Weave Cloud.

Counter-examples: Unfortunately, this rule isn’t very well applied. Images from services.git run in the default namespace. Flux runs in the fluxy namespace. Images which make up services in the monitoring namespace come from all-over, include service.git. But we should improve this…

Image hosting

What? We prefer that you host container images on CoreOS’ quay.io. It is quite common to have CI push images to both quay.io and the Docker hub, but your Kubernetes config should use quay.io.

Why? We’ve had problems pulling from the Docker Hub - nodes in our cluster have been blacklisted in the past. Having images in two repos means we can still pull new images in an emergency. And quay.io’s UI and permissions model is nicer.

Kubernetes organization

What? Put collections of microservices which make a “macro” service in their own namespace. Macro-services in this case include Scope, Cortex and Flux, along with the services that make up our common PaaS layer (sometimes called “Core” services, in the default namespaces) and the monitoring services (in the monitoring namespace).

What? Each Pod should have a single name label (distinct from the Pod name) which identifies the Service that Pod belongs too. Other labels are redundant.

Why? Using a consistent labeling scheme, and dividing related jobs up into individual namespaces, make finding the right job in tools like Prometheus much easier. For example, when we launched Cortex we were accidentally aggregating the Cortex Consul metrics in with the Scope ones.

Security and authentication

What? Don’t reinvent the wheel - all external traffic coming into our cluster should come through AuthFE. All traffic should use our existing authentication schemes - token-based instance authentication and cookie-based user authentication.

Why? Work has gone into make the traffic served through AuthFE is automatically secure against CSRF/XSS/Clickjacking attacks. AuthFE implements HTTPS redirects & HSTS to ensure all traffic coming into our cluster comes via HTTPS. Work has gone into making sure our authentication schemes are secure and performant. Is it perfect? No. But it’s almost certainly better than anything you can hack up in a few days…

Monitoring

What? All services must expose a Prometheus histogram of request durations. This can be used to build our common RED-style dashboards. This histogram should be named <namespace>_request_duration_seconds.

Why? Consistency of metrics and monitoring between services allows our on-call team (ie you, the developers) to support more and more services by making them all, at a high level, look and taste the same. Special cases are extraordinarily expensive in terms of training and cognitive load.

What? Services should probably use our common middleware for gathering and exposing these metrics.

Why? Developing middleware isn’t as easy at it first seems. There are a bunch of edge-cases that need to be dealt with, including ensuring that the cardinality of labels isn’t too high, and that you correctly detect WebSocket connections and record their latency separately.

For more best practices on metrics, read Prometheus’ best practice docs and previous posts on metrics and monitoring.

Constructing APIs

What? Use a common prefix for all the endpoints in your service (/api/foo), and include the service name in the path.

Why? Using a common prefix makes the routing rules in AuthFE smaller and easier to reason about. Including the service name in the path makes it easier to connect the dots when debugging.

What? For HTTP, make the API endpoints/paths on the internal service the same as the endpoint/paths you expect the public to use (ie, /api/foo/bar, not just /bar).

Why? Exposing the whole path on your services (as opposed to trimming off a common prefix in AuthFE) allows you to also expose private endpoints (/metrics, /debug/pprof, /traces etc) on the same port without worrying they will be publicly exposed.


Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.

Try it out, join our online user group for free talks and trainings , and come and hang out with us on Slack.