I just created a Cassandra cluster that spans 3 different network domains, by using 2 simple shell commands. How cool is that?
Today we welcome Yaron Rosenbaum as our first guest on the Weave blog, with a really cool use case! — alexis Over to you, Yaron.. I needed to create a Cassandra cluster that would span two different cloud providers. There are endless ways...
Today we welcome Yaron Rosenbaum as our first guest on the Weave blog, with a really cool use case!
— alexis
Over to you, Yaron..
I needed to create a Cassandra cluster that would span two different cloud providers. There are endless ways to achieve this (simplified) end result, something that’s definitely not new or unique, and has been done numerous times before. The goal of this post is to explain the philosophy behind, and benefits of my proposed approach of using Docker, and a SDN (Software Defined Network). Although this post explicitly talks about Cassandra, it is applicable to virtually any software stack that needs to network with its peers across a ‘network chasm’.
“All problems in computer science can be solved by another level of indirection” (David Wheeler)
Why I like to use Docker – enforcing boundaries
Drop a pebble in the water:
just a splash, and it is gone;
But there’s half-a-hundred ripples
Circling on and on and on,
Spreading, spreading from the center,
flowing on out to the sea.
And there is no way of telling
where the end is going to be.
…
Drop a pebble in the water:
in a minute you forget,
But there’s little waves a-flowing,
and there’s ripples circling yet,
And those little waves a-flowing
to a great big wave have grown;
You’ve disturbed a mighty river
just by dropping in a stone.
…
~By James W. Foley~
We all know that Code, Runtime stack, and Configuration are three different things that are interrelated: We create an application (code), we build it (‘Binary’), and we deploy it on a runtime stack (Tomcat, Play) that has been configured in a specific way (JNDI, network port, security…). There are endless variations of code, runtime stacks, runtime configurations, and combinations. These combinations can usually be assigned to one of the following families: Production, Testing, and Development.
Executing a Docker is just a single, simple shell command. Among other things, one of the greatest benefits of Docker is that it enables us to clearly define and enforce these boundaries and their desired composition, in a predictable, repeatable, managed fashion.
How to exactly achieve this is up to the user, as there are no solid ‘best practices’ yet. Docker has just recently hatched, and practices will materialise as Docker matures and gains more adoption.
So what’s the problem?
After all this talk and slight glorification of Docker, surprisingly – I will be using Docker to run Cassandra. Out of the tens of Cassandra Docker images out there, I chose the one maintained by pokle.
So let’s summarise everything up until this point: blah blah blah, I decided to use Docker, and I found a Cassandra Docker image I like. Great.
But Docker is a ‘containerizer’ – it’s whole goal is to create a contained environment for a process (in this case – Cassandra). How can we get our two containers to communicate with each other?
If these were two Docker containers running in the same host, the answer would be linking them. But although that may be somewhat useful for development, it is of little value for production and even testing. The whole goal of a cluster is to increase availability and performance by running Cassandra (in this case) on multiple physical hosts at the same time.
To do that within a network domain, one could just expose the docker containers to the host (using the Docker ‘expose’ or -p commands). But my scenario was quite a bit more complex than that – not only do I need to have a cluster that spans multiple hosts, it also spans multiple different cloud providers. This means that hosts on cloud provider 1, don’t communicate directly with hosts on cloud provider 2. Obviously we need to do some networking configuration on each cloud to allow this communication – like obtaining a public IP, securing it, etc – but still – how do we connect Cassandra 1 on Azure to Cassandra 2 on AWS to Cassandra 3 on IBM BlueMix?
So what’s ‘Weave’ ?
The same way that Docker helps us separate the concerns of code, runtime stack and configuration, Weave adds a layer of indirection between the runtime stack and the network configuration.
Using Weave, you can define your actual deployment stack in terms of a virtual network, or ‘imaginary IP numbers’ if you want: Weave creates a Software Network (or an Overlay Network), which is a fancy name for a virtual Ethernet switch facade for your Docker containers. Just imagine each Docker container as a small box that you connect via a cable to this switch, at will. You can also connect external network services to this switch, expose Docker services to the outside world and more – but that’s not needed for what we’re going to do. We’re just going to use the most basic feature of Weave: creating the overlay network.
For Weave to work, Weave hosts need to be able to communicate with their peers. It is up to the reader to make sure that the different Weave peers can communicate over WAN (e.g – have a publically accessible IP, ideally with security taken into account). Hence – demonstrating in practice the value of separation of concerns – at least for me 🙂
This also means that Weave hosts need to know about their peers. This is addressed by supplying Docker environment variables in the command line.
Using Weave and Docker to create a distributed Cassandra cluster across AWS, Azure and my Laptop
Acta, non verba:
- Install Docker on your laptop / workstation
In my case, I have a Mac running OS-X, so I use boot2docker - Get an AWS (or Azure, or …) instance with a public IP and install Docker (I used CoreOS)
- Download and install Weave on both
- Repeat
Now, let’s create and use a 10.2.0.0/16 virtual network, and setup a Cassandra cluster between our nodes, with addresses in this CIDR range.
For each remote server (Azure, AWS):
# Setup the virtual Ethernet switch, and assign a free IP and address space sudo ./weave launch 10.2.0.1/16 <WAN IP addresses of other servers running Weave # Run a Cassandra instance, set SEED to virtual(!) IP of itself (important!) and the Cassandra nodes sudo ./weave run 10.2.1.3/24 -d --name cass2 -e SNITCH=GossipingPropertyFileSnitch -e DC=AWS -e RACK=RACK1 -e SEEDS=10.2.1.3,10.2.1.4 -e LISTEN_ADDRESS=10.2.1.3 poklet/cassandra # you can repeat this for every remote server, but note that the run and launch must be unique, and that SEEDS=... needs to point to the other Cassandra peers.
On your laptop (optional, but really cool)
If you have boot2docker on your laptop (otherwise, use the same procedure as a ‘remote server’):
# Setup the virtual Ethernet switch, and assign a free IP (note – 10.2.0.1 is already taken) and address space boot2docker ssh “sudo ./weave launch 10.2.0.2/16 <WAN IP of your remote server 1>,<WAN IP of your remote server 2>..” boot2docker ssh "sudo ./weave run 10.2.1.4/24 -d --name cass1 -e SNITCH=GossipingPropertyFileSnitch -e DC=HOME -e RACK=LAPTOP -e SEEDS=10.2.1.3 -e LISTEN_ADDRESS=10.2.1.4 poklet/cassandra"
Your cluster should be up and running.
Don’t believe me? check!
Using / Testing the cluster
Opening CQL shell:
To open a CQL shell, on any of the nodes:
docker run -it --rm --link cass1:cass poklet/cassandra cqlsh cass
To test our setup, let’s create some data on one of the nodes, and then select it on any other node in the cluster:
Loading data into the cluster
Copy the following lines, and paste them in CQL shell:
CREATE KEYSPACE test_keyspace WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 2}; USE test_keyspace; CREATE TABLE test_table ( id text, test_value text, PRIMARY KEY (id) ); INSERT INTO test_table (id, test_value) VALUES ('1', 'one'); INSERT INTO test_table (id, test_value) VALUES ('2', 'two'); INSERT INTO test_table (id, test_value) VALUES ('3', 'three'); SELECT FROM test_table;
* note – ‘replication_factor’:2 – so that j write will replicate to two nodes. There are smarter things to do in a real Cassandra cluster (which I would love to see in comments to this post), but this is good enough for this demonstration’s purpose.
Now, open CQL shell on another instance (that’s the whole point), and do:
USE test_keyspace; SELECT * FROM test_table;
You should see:
id | test_value ----+------------ 3 | three 2 | two 1 | one (3 rows)
What we just did
We just created a Cassandra cluster, across multiple Cloud providers and a Laptop, by using Docker as a container for Cassandra and Weave as an overlay network. This demonstrates the value of Docker and Weave, which add necessary levels of indirection / abstraction, which result in clearly defined boundaries, what’s known as ‘separation of concerns’.
Hopefully this post will help others, not just on the practical plane of ‘getting a Cassandra cluster’ but also – and more importantly – by starting a discussion around the future of IT in wake of Docker.
The future
But wait, there’s more:
The Weave team back in London are not resting, and neither are Docker. Weavers are working on Service Discovery, integration with Docker’s new libraries, other projects like Consul, CoreOS, Kubernetes, Flocker.
As for myself, I think it’s time to have a fresh look at utility computing. I’ve just taken my first baby steps in a long journey towards a micro-services, multi-cloud architecture that I am working on, that would hopefully excite a lot of people and make a big impact on the way utility computing is done, and the economics behind it. This blog post is my first ever, and I hope that many more would follow. You could find my notes on this journey on my own website, www.multicloud.me and blog, soon to be published.
–Yaron