In a recent Weave Online User Group (WOUG), David Aronchick (@aronchick), Head of Open Source ML Strategy at Microsoft spoke on how you can bring machine learning to production with machine learning operations or MLOps.
Microsoft and Machine Learning
Microsoft Azure offers a number of services for the machine learning community including datasets, and models that are open source. In addition to this, the company offers an entire platform for machine learning, including rich data services that work both on-premise and in the cloud.
Microsoft is a leader in machine learning and has advanced machine vision object recognition, meeting or bettering human parity through deep residual learning for image recognition or ResNets. And has also made major advances in speech recognition, machine translation and on reading comprehension or natural language processing, some of which has been open sourced. While Microsoft gives back to the community through open source, machine learning is and always has been an important part of the company’s product strategy.
Machine learning is difficult to get right
One of the biggest reasons that machine learning is difficult is in building the model and being able to iterate on it to get it correct. In this sense, the actual end result of a model is a small part of the entire machine learning process that involves several steps before you even get to the stag where a model can be deployed.
Much like developers who really only care about their code and the resulting application features, most data scientists, only care about the model in the end. Unfortunately, the reality is that you need to make your way through a number of steps before you get to the model. A lot of it involves building efficient and reusable pipelines so that you can iterate on your design and can bring the model to production much more quickly.
Data scientists and SREs have competing interests
David then describes a common conflict in an organization that incorporates machine learning into their applications. Data scientists typically want to quickly iterate on their models to test out how they work in the production. They also want to use the frameworks they understand and they want unlimited scale. On the other side of the spectrum are thethe Machine Learning Engineers who want to reduce complexity and have their teams to reuse tools and platforms. In addition to this, ML engineers not only need to meet compliance regulations, but they also need to keep the application running reliably in production.
How do you bring these teams together?
GitOps brings these two teams together. Code can be iterated on and pushed and merged to git that sets off a set of automated tasks all the way into production. Git maintains the source of truth for what is running in production. This allows for a continuous cycle whereby anything pushed to git causes an alert to occur indicating a mismatch with what’s running on the cluster. The new code can then be rolled out to the cluster maintaining the balance or can be rolled back to maintain equilibrium. In this end, a continuous cycle provides a method of working that allows for fast innovation.
The same concepts also apply to machine learning, hence the term, MLOps. With GitOps, data scientists can experiment by integrating new data and iterating on the algorithm before it gets developed, and trained, and before it’s pushed to production or staging by the ML engineer. Since MLOps takes advantage of a typical development pipeline, it also integrates data scientists with the rest of the development team instead of them being off on her own island.
What are the benefits of MLOps?
GitOps benefits machine learning in many of the same ways as regular feature deployments:
Automation and observability
Code drives the generation of models and their deployments. Since everything is being driven through a common pipeline, models are reproducible and verifiable. All artifacts used for the model training are also tagged and audited offering better recoverability should a disaster occur.
GitOps gives you best practices around quality control, allowing you to compare models both online and offline. It also minimizes bias and enables explainability.
Reproducibility and Auditability
Because everything is being driven through a common pipeline, you’re able to do a lot more. For example, you can make comparisons and feed those results back in order to improve and iterate on the model.
Available ML Platforms
If you are a large company with 100s of data scientists, you might be using one of these hosted and integrated machine learning platforms:
But if you are not at a big company, you will be faced with having to build your own platform using the available open source projects.
Building your own MLOps Platform
Machine learning framework and training pipeline
On the one side you can choose a framework and pipeline for machine learning. This can be KubeFlow or Tensorflow. There are many frameworks out there that are open source. With the framework you will need a training pipeline. Training pipelines are provided as a service at different cloud providers and can be Azure machine learning, or it could be SageMaker from AWS or GCP Kubeflow.
Source control system
Next you’ll need a source control system. This could be: GitHub, GitLab or BitBucket or any other version control system.
The final step is to add the continuous integration and continuous deployment portion of the pipeline. There are many tools to choose from for CI: GitHub actions, Jenkins, etc. For GitOps style continuous deployment portion of the pipeline use Weave Flux.
Find out more about building pipelines for machine learning in the online talk: