Cluster Ready Checklist for Kubernetes
How do you know when you’re ready to run your Kubernetes cluster in production? In this blog series, we define Production Ready checklists for your cluster and applications.

How do you know when you’re ready to run your Kubernetes cluster in production? In this blog series, we’re going to look at what's typically included in a Production Readiness checklist for your cluster and your app.
These checklists were put together by Brice Fernandes (@fractallamda), a Weaveworks customer success engineer. If you’re lucky enough to attend an upcoming hands-on workshop led by Brice, production readiness will be a topic that he’ll be deep diving on.
In part 1 of this blog series, we’ll look at production ready checklists for clusters. In part 2, we’ll drill down on what to include on an application checklist.
What is Production Ready?
Production readiness is a term you hear a lot, and depending on who you are talking to and what they are doing, it can mean different things.
“Your offering is production ready when it exceeds customer expectations in a way that allows for business growth.” --Carter Morgan, Developer Advocate, Google.
Cluster production readiness is somewhat dependant on your use case and can be about making tradeoffs. Although a cluster can be production ready when it’s good enough to serve traffic, many agree that there are a minimum set of requirements you need before you can safely declare your cluster ready for commercial traffic.
When creating a reliable production set-up, the following areas are important. Some of these topics will be more important than others, depending on your specific use case.
Keep security vulnerabilities and attack surfaces to a minimum.
Ensure that the applications you are running are secure and that the data you are storing, whether it's your own or your clients, is secured against attack. In addition to this, be aware of security breaches within your own development environments. And because Kubernetes is a rapidly growing open source project, you’ll also need to be on top of the updates and patches so that they can be applied in a timely manner.- Maximize portability and scalability by fine tuning cluster performance.
Running and managing applications anywhere is why Kubernetes is the number one orchestrator, as is its ability to self-heal nodes, autoscale infrastructure and adapt to your expanding business. Most want all of the benefits of self-healing and scalability without taking a performance hit.
- Implement secure CICD pipelines for continuous delivery.
With your infrastructure in place, an automated continuous deployment and delivery pipeline allows your development team to maximize their velocity and improve productivity through increased release frequency and repeatable deployments.
- Apply observability as a deployment catalyst.
Observability is not only about being able to monitor your system, but it’s also about having a high-level view into your running services so that you can make decisions before you deploy. To achieve true observability you need the processes and tools in place to act on that monitoring through timely incident alerts.
- Create a disaster recovery plan.
Ensure that you have high availability which means that if you have a failure your cluster can recover automatically. In the case of complete cluster meltdown with the adoption of GitOps best practices
The Production Ready Checklist for Clusters
These are the areas that need attention before running your cluster in production.
What is it | Why you need it | Options | |
---|---|---|---|
Build Pipeline | Tests, integrates, builds and deposits container artefact to registry. Artefacts should be tagged with Git commit SHA to verify provenance | Ensures a bug free artefact before deployment. | CircleCI Travis Jenkins and others. |
Deployment pipeline | Takes the build artefacts, and delivers them to the cluster. This is where GitOps occurs. | More secure way of doing deployment. Can add approval checkpoints if needed. | Weave Cloud Flux |
Image Registry | Stores build artefacts. Needs credentials for CI to push and for cluster to pull images. | Keeps versioned artefacts available. | Roll your own Commercial: DockerHub JFrog GCP Registry |
Monitoring Infrastructure | Collects and stores metrics. | Understands your running system. Alerts when something goes wrong. | OSS: Prometheus Cortex, Thanos Commercial: Datadog Grafana Cloud Weave Cloud |
Shared Storage | Stores the persistent state of your application beyond the pod's lifetime. Seen by your app as a directory and can be read-only. | No one ever has a stateless app. | Many. Depends on the platform. |
Secrets management | How your applications access secret credentials across your application and to and from the cluster. | Secrets are required to access external services. | Bitnami sealed secrets. Hashicorp Vault. |
Ingress controller | Provides a common routing point for all inbound traffic. | Easier to manage authentication and logging. | AWS ELB NGINX Kong Traefik HAProxy Istio |
API Gateway | Creates a single point for incoming requests and is a higher level ingress controller that can replace an ingress controller. | Routes at HTTP level. Enables common and centralized tooling for tracing, logging, authentication. | Ambassador (Envoy) Roll your own |
Service Mesh | Provides an additional layer on top of Kubernetes to manage routing. | Enables complex use cases like Progressive Delivery. Adds inter-service TLS, Load balancing, service discovery, monitoring and tracing. | Linkerd Istio |
Service catalogue / Broker | Enables easy dependencies on services and service discovery for your team | Simplifies deploying applications. | Kubernetes’ service catalog API |
Network policies | Allows you to create rules on permitted connections and services. Requires a CNI plugin. | Prevents unauthorized access, improves security, segregates namespaces. | Weave Net Calico |
Authorization integration | Enables an API level integration into the Kubernetes auth flow. | Uses existing SSO to reduce the number of accounts and to centralize account management. Can integrate with almost any auth provider. | Requires custom work |
Image scanning | Automates the scanning of vulnerabilities in your container images. Implemented at the CI stage of your pipeline. | Because CVEs always happen. | Docker Snyk Twistlock Sonatype Aqua Security |
Log aggregation | Brings all the logs from your application into a one searchable place. | Logs are the best source of information on what went wrong. | Fluentdor ELK (Elasticsearch, Logstash, Kibana) stack are good bets for rolling your own. |