Disaster Recovery is not a new concept, in fact it is a requirement for critical business systems for almost as long as those critical business systems have existed. For modern distributed systems, teams can encounter two types of potential failures - unintentional errors and deliberate malicious attacks. Both require human intervention, especially if the distributed system is exposed to multiple errors simultaneously. This article discusses how teams can take advantage of GitOps principles in order to design with recovery in mind and strengthen existing systems.

Distributed systems provide a new challenge for Disaster Recovery

Distributed systems are complex with most modern application applications split into microservices, packaged into containers that are orchestrated with Kubernetes across multiple clouds.

Teams within larger enterprises may require multiple backends such as different cloud providers that can add an additional layer of complexity. The adoption of cloud native software tools such as Kubernetes where multiple applications for multiple departments are managed by a single orchestration layer, requires operators to have different disaster recovery (DR) practices in place. Lastly application development teams strive for self service when it comes to Kubernetes and need to iterate apps fast through automated deployments and simple rollbacks.

Design for Understandability and Observability

If a critical system error occurs, the understandability of your system is crucial to keep the incident brief versus experiencing a protracted disaster. SRE and or DevOps teams are tasked with understanding the operational behavior of the system and understanding the system’s variants, including security and availability. That way they can actively lower security vulnerabilities or resilience failures as well as keeping incident response times short and effective.

If teams can observe their system at any point in time and get alerted when the system diverges from the desired state, mitigation can happen before disaster strikes.

GitOps principles state that your entire system is described declaratively and versioned in Git. That means teams have a single source of truth from which everything is derived and driven. Also GitOps puts software agents in place that ensure correctness and alert on divergence. A feedback and control loop for system and operational tasks has now been established.

gitops-principles.png


Design for Speed (safely)

In order to quickly recover a system, teams should be able to rely on a quick rollout and rollback mechanism. GitOps pipelines are automated delivery pipelines that roll out changes to the system infrastructure when changes are made to Git. At the same time teams have a convenient audit log of all changes in Git. With Git’s capability to revert/rollback and fork, stable and reproducible rollbacks are guaranteed.

While Git is the source of truth for the desired state of the system, Observability provides a benchmark for the actual production state of the running system. In GitOps we take advantage of both to manage our applications and infrastructure. When combining GitOps workflows with real-time observability, development teams can make crucial decisions before they deploy any new features.

In our experience, speed increases even further when dev and ops teams are confident in their system and trust the underlying mechanisms. 

Explore some of our customers stories here.


But what about persistent data and other stateful services?

Almost every team has services that must deal with persistent data, which adds a special set of challenges. Access to the stateful applications like databases are typically done over the network and reside outside of your cluster.

In the past we partnered with third party vendors like Portworx to show you can manage stateful and stateless data with GitOps. Portworx allows you to run stateful services such as cassandra, elastic search on the same stack next to your stateless applications. Running stateful and stateless services on the same stack, one can improve performance, speed and data locality.

Also combining computing power and storage, allows larger scale deployments to thousands of nodes. Which in return gives teams the advantage to gain high availability for their applications. In the case of Portworx, when a node fails or a pod fails, Portworx works with Kubernetes, taking advantage of its scheduling ability to ensure that your data is always available within a cluster, across clusters or even across availability zones.

Watch a demonstration on how to recover a stateful application like a MySQL database during a rolling deployment with Weave Kubernetes Platform, Weave Cloud and Portworx.


Have questions on how to manage disaster recovery for Kubernetes?

The Weaveworks team can help you navigate the vast landscape of cloud native technologies and solutions. Together we can create a cloud native blueprint that fits your unique business needs, for example how to manage disaster recovery, high availability, GRC or build a self service platform.

Contact us for a demo of the Enterprise Kubernetes Platform or check out our Quickstart program to help you get up and running with Kubernetes.