This blog follows on from Part 1 - Distributed Systems, Disaster Recovery and GitOps

A core capability of any distributed system, whether it is distributed for capacity or recovery reasons, is that the initial system can be reproduced. More often than not, large organizations document changes in a Word document that get manually applied to each and every environment. While this has served as a way to propagate changes to core systems essentially for decades, it is not a reliable way to reproduce a system in today’s distributed world.

To boost reliability and reduce overhead, Infrastructure teams have always strived to use some level of automation to simplify the maintenance and creation of the environments that applications run in. With the current DevOps and cloud native revolution there is both an increase in available tools and the pressure from senior management to have better automation practices through-out all aspects of the technology stack.

From Manual Ops to GitOps

With the introduction of DevOps principles and cloud native technologies into an enterprise infrastructure, developers can leverage the scalability that modern cloud infrastructure has to offer with a declarative infrastructure and an API that allows developers and cluster operators to create, destroy and ideally rebuild all manner of resources within minutes instead of days.

From compute nodes to storage volumes to messaging endpoints, a new paradigm has emerged that takes declarative configuration and manages it in Git alongside your application code. What were traditionally written instructions and a few shell scripts are now replaced with YAML and other declarative templates that define all of the infrastructure, add-ons and tools an application developer needs for developing and running applications on Kubernetes without the platform team being directly involved.

Cluster platform configuration with GitOps

GitOps takes DevOps’ guiding principles into account and extends the simple concepts behind Infrastructure as a code.

GitOps is a pragmatic approach for immutable infrastructure and declarative systems such as Kubernetes and keeps infrastructure config versioned, backed up and reproducible from Git. It allows operational teams to manage and compare the current running state of both your infrastructure and applications, and test, deploy or roll back to a desired state with a Git commit. Git’s (or any other version control system) built-in audit trail improves compliance, and reliability in the production system. A developer’s familiarity of Git simplifies operations tasks to Kubernetes for most developers and the automated continuous deployment portion also increases the productivity of development teams.

An important aspect of GitOps that differs from “just keeping everything in Git” is the notion of there being a desired state which is constantly compared with the running state and an alert that indicates a drift. If there is a drift from the desired state, the system itself should be able to automatically correct the drift, effectively creating a feedback and control loop. This is an especially important concept to understand when considering a mean time to recovery with GitOps.


What’s Your Mean Time to Recovery?

GitOps can reproduce your entire platforms straight from Git and be up and running within minutes. But if you haven't considered how persistent data is re-loaded into that application, then your disaster recovery plan will be flawed. There are a couple ways to handle data backup and restoration, all of which can be automated and built into your GitOps pipelines.

Every modern cloud infrastructure platform supports sidecar injection of persistent data to individual containers or directly through to virtual servers that persist regardless; even if an individual instance of an application goes offline or an entire node goes offline.

When running Kubernetes, some organizations want to rely on infrastructure tier storage solutions such as SAN (storage area networks), NAS (network attached storage), and even SDS (software defined storage) to store and replicate data between nodes and between remote sites. This approach works great for distributing the data needed for traditional systems. Most cloud-native applications can support this, but this rarely allows for multiple active clusters without having a single point of failure they all share.

Steve Wade discusses how he took Mettle’s MTTR from days to minutes: DevOps Metrics - Success Follows Failure.

NoSQL Databases

The latest trend in databases, spearheaded by the NoSQL database vendors in the last decade, has been for database software to handle replication and data synchronization across multiple instances, which can be in the same or different sites. By building distributed systems – including at disaster recovery sites – to connect with the production database clusters through the automation build into an application’s GitOps workflow, an application can easily spin up a whole new site and be processing real transactions with real data without the massive backup and recovery scenarios (and the associated downtime) that everyone dreads when a traditional business system needs recovery.

Read more on how GitOps and databases in A GitOps RDS Migration Story


By leveraging GitOps and a modern cloud-native-friendly data platform, it has become possible for even the smallest organizations to have real and reliable disaster recovery plans. With the same technologies, these sites no longer need to be cold standby sites. They can be either provisioned as required to reduce costs, or even be part of the online services that are actively processing real transactions to provide instant failover in the event there is a disaster that affects a single site.  Want to learn more how GitOps can help? Take a look at our GitOps Discover, Design and Discovery Package for Kubernetes