AWS Container Days - Managing EKS at Scale with GitOps
Learn how to automate an entire machine learning pipeline with GitOps using GitHub Actions. The tutorial makes use of the Kubeflow Automated PipeLines Engine or KALE, introduces a novel way to version trained models and describes how to progressively deliver trained models.
At a recent AWS Container Days event, Alexis Richardson, CEO of Weaveworks delivered a keynote talk entitled, “Managing EKS at Scale with GitOps”. Also speaking that day on GitOps, its history and its evolution were Bob Wise, AWS General Manager of Kubernetes and Pratik Wadher, VP Product Development, Intuit.
The entire day focused on GitOps and also discussed in detail, the recent announcement of the merger between the two popular projects Flux and Argo to create the GitOps Engine. You can read more about the Flux and Argo merger in this blog post. Alexis then took the stage and spoke about the past, present and future of GitOps.
The Genesis of GitOps
“GitOps is our name for how we use developer tooling to drive operations...Git is a part of every developer’s toolkit. Using the practices outlined in this post, our developers operate Kubernetes via Git. We manage and monitor all of our applications and the whole ‘cloud native stack’ using GitOps.” -- excerpt from original blog post - GitOps Operations by Pull Request.
GitOps builds on and extends concepts we learned from infrastructure as code and 90% of what we are discussing when it comes to GitOps is not new. However there are some important key differences between traditional infrastructure as code tools and what we call GitOps. With GitOps change is not orchestrated from outside the system anymore. Instead, with tools like Kubernetes, Docker containers and Argo Flux changes are driven from inside of the system. Making changes to the cluster in this fashion not only increases security, but it also integrates your processes natively and allows you to automate pipelines which increases reliability. In the end this level automation and integration means that you can more easily scale.
The birth of GitOps - from disaster recovery to a new methodology
Weaveworks has been operating Kubernetes since it first came out about five years ago. We use it for our SaaS product Weave Cloud.
GitOps came about because of an error from a fat-fingered engineer who accidently hit the wrong button and ended up wiping out our entire system. After this happened, the team went into crisis mode trying to recover from the meltdown. Alexis watched as the team was able to recover the entire system in about 45 minutes. These days we can do that much more quickly, in less than five minutes.
How did Weaveworks recover from disaster in less than 45 minutes?
The team was able to recover because of the following things we had in place:
- Declarative infrastructure - Docker, Kubernetes, Terraform and other immutable objects.
- Entire system versioned in Git - Everything that could be version controlled is kept in Git. This includes code, configuration manifests, dashboards,etc. All of it is kept in Git with a full audit trail.
Since everything is kept in Git, we can roll out and reproduce a system and be up and running again within minutes. In addition to this, any changes to the application or to the cluster can be deployed as atomic pull requests and automatically converged with the cluster, if there is a difference detected from the source of truth that is kept in Git.
The convergence of the cluster and it’s application occurs because of Flux. Now together with Argo, we will now have a unified and standard GitOps tool and methodology that can manage atomic updates and that everyone can use.
Future of Argo and Flux
As mentioned, the merging of the Argo and Flux projects will provide the community with a standard set of tools. Other features that are planned for the merged project include the addition of progressive delivery for blue/green and canary deployments, as well as policy and compliance.
Why progressive deployments?
As an example, if you have 2,000 microservices running with more than 3,000 A/B tests running on different aspects across your team of 700 developers you will need a robust cloud native aware progressive delivery solution. This may today sound far-fetched now, but that’s what people will be doing very soon with these technologies, says Alexis. At some point you will want to experiment, test and tweak in order to optimize your application to quickly meet your customer’s needs and to ultimately increase your ROI.
GitOps for cluster management
The next step in the evolution of GitOps is complete control over how your clusters are built and rolled out in your organization. Developing for Kubernetes as most of you know involves more than just the cluster. There are a whole host of other technologies around Kubernetes that you need before you can start developing your application, debugging it and monitoring it after it’s running in production. In addition to this, you may have specific technical requirements that apply to just your application, for example perhaps you need to take advantage of machine learning algorithms?
Below is a developer workflow using GitHub Actions that runs tests, integrates code, builds a container, and if everything passes, checks it into Amazon's container registry. At this point an alert from the GitOps operator Flux is sent indicating that the cluster differs from what’s in Git.
Developer ready clusters straight from Git
What if you can start using a stack that’s been pre-configured for what you are developing. And is developer-ready after running a simple `git clone` on the command line?
Maybe you need a machine learning stack or a mobile stack that lets you create a mobile app. One of the biggest problems with Kubernetes is its flexibility. Now normally you wouldn’t consider choice to be a problem, and having a large number of projects to choose from is fantastic, but the problem comes when you need to configure all of these different apps to work with one another. Not only can it be a challenge to integrate them, but creating clusters across environments and teams in a consistent, repeatable way is time consuming and can be error prone.
For more on how you can manage entire cluster platforms in Git with GitOps watch the presentation below: