Multi-Cloud Big Data Processing with Flink, Docker Swarm and Weave Plugin
This post shows step by step how to set up a multi-cloud environment for big data processing using Apache Flink, Docker Swarm and the new Weave Net Docker plugin.

This is a guest post from Thalita Vergilio, a Java developer interested in multi-cloud architectures, design patterns, container orchestration and big data processing. She is currently studying for a PhD at Leeds Beckett University.
In this post, I would like to share my experiences of setting up a multi-cloud environment for big data processing using Apache Flink, Docker Swarm and the new Weave Net Docker plugin.
The idea is to approach big data processing from a cloud consumer’s perspective, and to enable distribution and scalability down to service level – or even down to code level. Frameworks such as Flink provide the ability to break down data processing into smaller parallelisable units which are distributed across different nodes for processing. Deploying these nodes to the cloud has, due to the cloud’s elasticity, the advantage of being able to easily scale the resources up and down. Even better, distributing the nodes across different clouds lowers the risk of vendor lock-in and allows the cloud consumer to pick the most cost-effective way of scaling up their resources.
Specification
- The cloud service model selected for this experiment was PaaS, with Ubuntu 16.0.4.3 LTS Xenial virtual machines commissioned from Microsoft Azure and Google Cloud.
- Each node has 2CPUs, 7GB memory and 30GiB disk.
Configuration
- Each node has a public static IP address.
- The following ports are exposed on each node:
- Used by Docker Swarm:
- TCP port 2377 for cluster management communications
- TCP and UDP port 7946 for communication among nodes
- UDP port 4789 for overlay network traffic
- Used by Weave:
- TCP port 6783
- UDP port 6784
- Used by Flink:
- 6123 jobmanager.rpc.port
- 6124 blob.server.port
- 6125 query.server.port
- 6126 taskmanager.data.port
- 6127 taskmanager.rpc.port
- Docker CE is installed on each node.
- The Weave Net Docker plugin is installed on each node.
Installation Steps
Set up the nodes
Create virtual machines in two different clouds. I used Microsoft Azure and Google Cloud, followed the Specification above, and created the nodes as follows:
- azure-swarm-manager (for managing the Swarm)
- azure-flink-jobmanager (for managing the parallel processing of data)
- azure-swarm-worker-1 (for processing the data)
- azure-swarm-worker-2 (for processing the data)
- google-cloud-worker-1 (for processing the data and as a multi-cloud proof-of-concept)
Install Docker CE
Docker CE needs to be installed on all nodes.
- Follow the instructions from: https://docs.docker.com/engine/installation/linux/...
- Add sudo permissions to docker user:
- sudo usermod -aG docker $USER
- newgrp docker
Install Weave Net Docker Plugin
TheWeave Net Docker plugin needs to be installed on every node to create the multi-cloud network.
- sudo docker plugin install weaveworks/net-plugin:latest_release --grant-all-permissions --disable
- docker plugin set weaveworks/net-plugin:latest_release IPALLOC_RANGE=172.30.0.0/16 (you can pick any range, but it must be the same on all nodes)
- docker plugin enable weaveworks/net-plugin:latest_release
Running Weave Net
Weave Slack channel was the place to go. The Weave team advised me to set up Weave Net by following the instructions on this page. This needs to be done on every node.
- sudo curl -L git.io/weave -o /usr/local/bin/weave
- sudo chmod a+x /usr/local/bin/weave
- weave launch
- weave connect ${IP address of manager}
At this point, if you run ‘weave status peers’, you should be able to see all the nodes networked.
Create the Swarm
In the Swarm Manager node:
- docker swarm init --advertise-addr ${IP address of manager}
In each Worker node:
- docker swarm join --token ${swarm token} --advertise-addr ${public IP address of node} ${IP address of manager}:2377
The ‘--advertise-addr’ is important. Without it, it will only work in a single-cloud setup.
Create the Network and the Services
I am using a customised Flink image which sets the ports used by the Task Manager to known values. The default behaviour is to assign a free port at runtime, which didn’t work in this case, since the ports need to be explicitly exposed.
The image I created is available on Docker Hub: https://hub.docker.com/r/tvergilio/docker-flink/
Create the services in the Swarm Manager node:
- docker network create --driver=weaveworks/net-plugin:latest_release mynetwork
- docker service create --name jobmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager -p 8081:8081 -p 6123:6123 -p 6124:6124 -p 6125:6125 --network=mynetwork --constraint 'node.hostname == azure-flink-jobmanager' tvergilio/docker-flink:latest jobmanager
- docker service create --name taskmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager -p 6121:6121 -p 6122:6122 --network=mynetwork --constraint 'node.hostname != azure-flink-jobmanager' --constraint 'node.hostname != azure-swarm-manager' tvergilio/docker-flink:latest taskmanager
Scale the ‘taskmanager’ service, so one is deployed to each node:
- docker service scale taskmanager=3
You should be able to inspect your Task Manager services:
Run a Big Data Processing Pipeline in Parallel
I have used Apache Beam to create a simple Big Data processing pipeline. Apache Beam adds another layer of separation to the architecture, as it decouples the processing code from the Big Data framework used to run it. In this case, Flink is being used to run the Beam pipeline, but it could as easily have been Spark or Google Dataflow.
I will leave the details of the Beam implementation for another post. To put it briefly, the processing code I wrote captures real-time data posted to a Kafka topic, divides it into sliding windows of 5 minutes, starting every minute, then processes the data window by window. It outputs the results to a different Kafka topic.
- Use a browser to log onto the Flink UI on ${IP address of manager}:8081.
- Upload the Beam WAR file.
- Watch it run and parallelise the work beautifully across clouds
Weave Cloud
While working on this project, I was given a free trial of the Weave Cloud service offered by Weave. I must say, once you have all your nodes set up and configured, it is a very nice visual tool for monitoring and managing your swarm.
You can see all your nodes connected and talking to each other:
And you can see which services are running on your stack, as well as which containers are running on each node.
The Containers view was particularly helpful once I started scaling up my services.
Acknowledgements
I would like to sincerely thank the Weaveworks team for all their help and support with my countless networking issues during the development of this project. You guys are awesome, and your plugin is just what was missing from Docker Swarm.
This work made use of the Open Science Data Cloud (OSDC) which is an Open Commons Consortium (OCC)-sponsored project.
Cloud computing resources were provided by a Microsoft Azure for Research award.