How do you know when you’re ready to run your Kubernetes cluster in production? In this blog series, we’re going to look at what's typically included in a Production Readiness checklist for your cluster and your app. 

These checklists were put together by Brice Fernandes (@fractallamda), a Weaveworks customer success engineer.  If you’re lucky enough to attend an upcoming hands-on workshop led by Brice, production readiness will be a topic that he’ll be deep diving on. 

In part 1 of this blog series, we’ll look at production ready checklists for clusters. In part 2, we’ll drill down on what to include on an application checklist.

What is Production Ready?

Production readiness is a term you hear a lot, and depending on who you are talking to and what they are doing, it can mean different things.

“Your offering is production ready when it exceeds customer expectations in a way that allows for business growth.” --Carter Morgan, Developer Advocate, Google.

Cluster production readiness is somewhat dependant on your use case and can be about making tradeoffs. Although a cluster can be production ready when it’s good enough to serve traffic, many agree that there are a minimum set of requirements you need before you can safely declare your cluster ready for commercial traffic.

When creating a reliable production set-up, the following areas are important. Some of these topics will be more important than others, depending on your specific use case.

  1. Keep security vulnerabilities and attack surfaces to a minimum.
    Ensure that the applications you are running are secure and that the data you are storing, whether it's your own or your clients, is secured against attack. In addition to this, be aware of security breaches within your own development environments. And because Kubernetes is a rapidly growing open source project, you’ll also need to be on top of the updates and patches so that they can be applied in a timely manner.

  2. Maximize portability and scalability by fine tuning cluster performance.

    Running and managing applications anywhere is why Kubernetes is the number one orchestrator, as is its ability to self-heal nodes, autoscale infrastructure and adapt to your expanding business. Most want all of the benefits of self-healing and scalability without taking a performance hit.

  3. Implement secure CICD pipelines for continuous delivery.

    With your infrastructure in place, an automated continuous deployment and delivery pipeline allows your development team to maximize their velocity and improve productivity through increased release frequency and repeatable deployments.

  4. Apply observability as a deployment catalyst.

    Observability is not only about being able to monitor your system, but it’s also about having a high-level view into your running services so that you can make decisions before you deploy. To achieve true observability you need the processes and tools in place to act on that monitoring through timely incident alerts.

  5. Create a disaster recovery plan.

    Ensure that you have high availability which means that if you have a failure your cluster can recover automatically. In the case of complete cluster meltdown with the adoption of GitOps best practices

The Production Ready Checklist for Clusters

These are the areas that need attention before running your cluster in production.

What is itWhy you need itOptions
Build PipelineTests, integrates, builds and deposits container artefact to registry.
Artefacts should be tagged with Git commit SHA to verify provenance
Ensures a bug free artefact before deployment.CircleCI
Travis
Jenkins
and others. 
Deployment pipelineTakes the build artefacts, and delivers them to the cluster. This is where GitOps occurs. More secure way of doing deployment. Can add approval checkpoints if needed. Weave Cloud
Flux
Image RegistryStores build artefacts. 
Needs credentials for CI to push and for cluster to pull images. 
Keeps versioned artefacts available. Roll your own
Commercial: 
DockerHub
JFrog
GCP Registry
Monitoring InfrastructureCollects and stores metrics. Understands your running system. Alerts when something goes wrong. OSS:
Prometheus
Cortex,
Thanos 


Commercial:
Datadog

Grafana Cloud

Weave Cloud
Shared StorageStores the persistent state of your application beyond the pod's lifetime. 
Seen by your app as a directory and can be read-only. 
No one ever has a stateless app. Many.
Depends on the platform. 
Secrets managementHow your applications access secret credentials across your application and to and from the cluster. Secrets are required to access external services. Bitnami sealed secrets. 
Hashicorp Vault. 
Ingress controllerProvides a common routing point for all inbound traffic. Easier to manage authentication and logging. AWS ELB
NGINX

Kong

Traefik

HAProxy

Istio
API GatewayCreates a single point for incoming requests and is a higher level ingress controller that can replace an ingress controller. Routes at HTTP level. Enables common and centralized tooling for tracing, logging, authentication. Ambassador (Envoy)

Roll your own
Service MeshProvides an additional layer on top of Kubernetes to manage routing. Enables complex use cases like Progressive Delivery. Adds inter-service TLS, Load balancing, service discovery, monitoring and tracing. Linkerd
Istio
Service catalogue / BrokerEnables easy dependencies on services and service discovery for your teamSimplifies deploying applications.Kubernetes’ service catalog API
Network policiesAllows you to create rules on permitted connections and services. Requires a CNI plugin. Prevents unauthorized access, improves security, segregates namespaces.Weave Net
Calico
Authorization integrationEnables an API level integration into the Kubernetes auth flow.Uses existing SSO to reduce the number of accounts and to centralize account management. Can integrate with almost any auth provider. Requires custom work
Image scanningAutomates the scanning of vulnerabilities in your container images. Implemented at the CI stage of your pipeline.
Because CVEs always happen.Docker 
Snyk
Twistlock
Sonatype
Aqua Security
Log aggregationBrings all the logs from your application into a one searchable place.Logs are the best source of information on what went wrong.Fluentd or ELK
(Elasticsearch, Logstash, Kibana) stack are good bets for rolling your own.