In a prior post on machine learning and GitOps, we described how you can use an MLOps profile to run a fully configured Kubeflow pipeline for training machine learning models on either Amazon’s managed Kubernetes service, EKS, or on clusters created with Firekube.
In this post, we’ll take that same concept a couple of steps further and show how you can automate and manage the entire machine learning pipeline from training through to deployment with GitOps using GitHub Actions. The tutorial makes use of the Kubeflow Automated PipeLines Engine (or KALE), and it also introduces a novel way to version trained models that can be picked up by Weave Flagger for progressive deployments.
This post is divided into the following sections:
- Triggering a GitHub Action from a GitHub repository that compiles a Jupyter notebook into a Kubeflow pipeline on EKS.
- A description of what a “model container” is and how it can be used to version machine learning models.
- How to deliver trained models progressively with Weave Flagger.
- Getting started quickly with an opinionated cluster using a special Firekube distribution for MLOps that can be cloned from a GitHub repo.
GitHub Actions for Compiling and Running Kubeflow Pipelines
GitHub Actions makes it easy to automate CI/CD pipelines all from within GitHub without having to context switch to another application.
In this example GitHub actions are used to check out the Jupyter notebook from a GitHub repository, and then compile and train it on a Kubeflow pipeline on an EKS cluster. If you’re a data scientist, you are probably already using Jupyter notebooks. For those of you who are outside of the machine learning field, you can think of a Jupyter notebook as an IDE with the REPL capability. A Jupyter notebook is usually the first program ML people are going to use when they start preparing or training their machine learning models.
A GitHub Action for managing Kubeflow pipelines
The GitHub Action for EKSctl / KALE is a custom action that connects a GitHub repository containing our Jupyter Notebooks to an EKS cluster. It bundles a standard EKSctl binary with Kubeflow Automated Pipeline Engine or KALE. The Action is set to be triggered whenever a change to the Jupyter Notebook is pushed to the repository.
The manifest for it is as follows:
The Action starts by telling EKSctl to retrieve the Kubeconfig file from the cluster. It then sends the Jupyter notebook to the cluster where it gets compiled, before running on an ML pipeline in Kubeflow. After the model finishes running through the pipeline the finished result is stored in an S3-compatible storage inside Kubeflow.
Versioning Trained Models with Containers
A problem we’ve run across is the versioning of trained models. As mentioned, trained models are typically stored in an S3 bucket. And when we need to deploy a model, it is retrieved from S3 and served directly with an inference service. However, delivering models directly from S3 makes it difficult to control versions of the models.
So, why not use containers as the delivery medium for models?
It is fortunate that Kubernetes supports the concept of init-containers inside a Pod. When we serve different machine learning models, we change only the model, not the serving program. This means that we can pack our models into simple model containers, and run them as the init-container of the serving program. We also include a simple copy script inside of our model containers that copies the model file from the init container into a volume so that it can be served by the model serving program.
Versioning models in containers allows you to employ more advanced deployment strategies and other progressive delivery techniques. In addition to this, when model service Pods are scaled, we save a lot of download bandwidth since the model containers are already cached by Kubernetes.
Model containers are specified under initContainers in the Pod template deployment manifest (YAML).
Progressive Delivery machine learning models with Weave Flagger
Flagger is the progressive delivery engine for Kubernetes. It provides a way to update our application with a canary deployment model (and other techniques). Another good thing about Flagger is that it’s very generic. Flagger scans every field inside a deployment manifest to detect any version changes. With this functionality, we can use Flagger to also detect changes at the init-container fields that define our model containers. This is all that we need to run Flagger and to serve machine learning models progressively.
We can just change the version of our container models from v1 to v2 then commit and push to see the progress. This also means that machine learning models are delivered progressively with Flagger as well as being fully managed via GitOps.. Check out the video below to see the traffic flow during the progressive delivery process.
Try WKS Firekube for MLOps
In this post, we showed how you can deliver machine learning models progressively with GitOps, GitHub Actions, Flagger, EKS, Firekube and Kubeflow.
Get started with your own MLOps cluster with this special Firekube distribution for MLOps that contains a minimized version of Kubeflow. To get a cluster up and running, just following these steps.
- Clone WKS Firekube for MLOps from WKS- Firekube-MLOps
$ git clone https://github.com/chanwit/wks-firekube-mlops
- Change into the directory
$ cd wks-firekube-mlops
- Run the start.sh script to bootstrap Firekube together with Kubeflow