Make Machine Learning on Kubernetes portable and observable with Kubeflow and Weave Cloud
This step by step tutorial shows how to set up Kubeflow, a tool that simplifies set up of a portable Machine Learning stack and Weave Cloud on the Google Cloud Platform. Kubeflow users will then be able to use Weave Cloud to observe and monitor the stack, including metrics for resource management.

≈Kubeflow
One of the big announcements from KubeCon + Cloud Native Con 2017 in early December 2017 was about Kubeflow, an open-source project “dedicated to making Machine Learning (ML) on Kubernetes easy, portable and scalable.” Kubeflow is targeted toward users who may want portable stacks, more control, simplification for their ML stacks, or the ability to use their Kubernetes deployments on different platforms, on-premises, etc. For instance, a vision for Kubeflow would be to have different teams (Data Scientists, Devs, IT, etc) sharing or handing off systems without worrying about who might still need to manage the underlying infrastructure. Kubeflow is intended to leverage Kubernetes’ ability for deploying on diverse infrastructure, deploying and managing loosely-coupled microservices, and scaling based on demand. For people using a single-cloud, hosted ML service today, Kubeflow may offer an alternative solution to meet different user needs.
Kubeflow and Weave Cloud
Among Weaveworks’ users today, companies such as Qordoba and Seldon already see the value of using Kubernetes for Machine Learning. Stay tuned for future Weaveworks blog posts that will cover their and others’ potential use cases around Kubeflow and Weave.
Kubeflow users can leverage Weave Cloud to simplify the observability, deployments, and monitoring of Kubeflow running on Kubernetes clusters. Especially if Kubeflow users follow the GitOps methodology, they can have their manifests in a repo as a single source of truth; thus the vision for sharing and handing off Kubeflow systems can be operationalized with git push and Weave Cloud.
Weave Cloud’s monitoring capabilities also help with a variety of metrics, including resources management, which can be critical for Kubeflow. Weave Cloud’s UI offers quick interactive views to CPU and memory usage from the high-level overview:
To drill down to resources monitoring for processes, containers, pods, and hosts, as well as by service and namespace, see below for further details.
Getting started with Kubeflow and Weave Cloud
- Kubeflow on GitHub: Clone the Kubeflow repository. (These will install the Kubernetes manifests to run Kubeflow on a production cluster).
- Google Cloud Platform:
- Set up a cluster in Google Cloud Platform.
- We created a basic 3-node cluster for this demo.
- Follow the steps for Kubeflow on Google Kubernetes Engine.
- Follow the Quick Start steps (which use ks apply). This installs all of the manifests recursively that appear in components.
- (Optional: If you want to run code to train TensorFlow convolutional neural network (TF CNN) models, the jobs can be done by running kubectl as well).
- Weave Cloud:
- Create a free trial account.
- Follow the set-up steps for Weave Cloud Deploy, Explore, and Monitor. (You will be deploying agents to send metrics to leverage the Weave Cloud UI). <img src="https://images.contentstack.io/v3/assets/blt300387d93dabf50e/bltb4c5b03a3cb5b1be/5a3adcbf7820634f7c94f477/weave-cloud-onboarding.png" data-sys-asset-uid="bltb4c5b03a3cb5b1be" alt="weave-cloud-onboarding.png" "="">
- Note: the Weave Cloud Deploy set-up process injects a Deploy key into the GH repo that you created earlier. (You can’t inject the key into the Google Kubeflow repo, so that’s why you clone the Kubeflow repo in GH).
- Click on the Explore button in Weave Cloud to visualize your new Kubeflow cluster in Google Cloud.
- Weave Cloud also offers monitoring based on hosted, multi-tenant, and scalable Prometheus. With Weave Cloud, you can store Prometheus metrics over months for querying using Prometheus’ powerful query language, PromQL.
- Weave Cloud’s monitoring capabilities also help with a variety of metrics, including resources management, which can be critical for Kubeflow. To drill down to processes, containers, and hosts, click on the “bar graph” icon:
- As well as pods (Note the “water level” in each pod that indicates CPU usage for the tf-job-operator pod):
- Weave Cloud’s monitoring also shows resources by service and namespace:
The Kubeflow announcement made it to #1 on Hacker News when it was announced at KubeCon and there has been a lot of interest with ML companies testing it out.
We hope you found this step by step guide helpful. Please join our conversations on Twitter @weaveworks and @kubeflow or share your thoughts on Slack #weaveworks and #kubeflow.