Using Weave Network for Big Data: ElasticSearch & Apache Spark
As part of our continuing effort demonstrating how weave overlay networking makes deploying complex distribute software stacks simpler, I decided to pick a couple of very hot open-source projects and show something people commonly do:...
As part of our continuing effort demonstrating how weave overlay networking makes deploying complex distribute software stacks simpler, I decided to pick a couple of very hot open-source projects and show something people commonly do: namely, analysis of Twitter streams with the help of ElasticSearch and Apache Spark. The former being a highly scalable and blazingly fast search engine, and the latter being the latest BigData analysis platform.
I will first briefly introduce some of the key parts of the demo, and then, by means of a screencast I will provide a step-by-step guide. We are intending to follow-up with a more technically detailed blog post very shortly, please be sure to watch out for it and send us any comments or suggestions regarding the contents of this screencast to firstname.lastname@example.org.
Firstly, I thought I should provide some references to the use of ElasticSearch and Spark for Twitter analytics. A partnership between Databricks (core maintainers of Spark) and ElasticSearch was announced earlier this year, which was followed by Holden Karau’s (@holdenkarau) demonstration at the Spark Summit, with the demo code available. Additionally I found a more recent talk and code by Alexis Seigneurin (@aseigneurin), which also used both of the pieces.
Unlike the aforementioned authors, I decided to take simple path in how I populate ElasticSearch with tweets, and pick the off-the-shelf Twitter River Plugin. My Spark code again is a lot simpler: it’s really just a “Hello, World” for Spark, but it works and is easy to reproduce. I hope you will find a much more interesting computation to perform, given you now have some great tooling to help you to bootstrap a cluster in the clouds.
One reason you may wish to use weave in a cross-cloud setting is when you find some specific features each cloud vendor offers, that you want to leverage for different parts of your application. Those may be as low-level as, for example, if you wanted to use GPU-enabled instances for your Spark cluster that vendor A provides, or as high-level as the pricing of cross-region data transfer that vendor B offers. However, this is not the only reason — more on that in upcoming posts.
One very important aspect that weave enables in ElasticSearch is how a containerised ElasticSearch cluster is formed. On a single local network one would require some work to make containers on different hosts talk to each other, and the discovery mechanisms would still work; but making it work across DCs would require additional work. With weave all of this becomes very simple. Adding weaveDNS to the mix provides some extra convenience, whereby firstly it is easier for humans to read configuration files or logs, and secondly the IP addresses assigned to each of the nodes become less relevant (watch out for upcoming feature announcements regarding this).
The video below demonstrates how weave can help you to deploy very complex distributed software stack across two different cloud providers, each in a different geo-location. I showcase how weave makes it very simple to bring up a cross-cloud infrastructure running Apache Spark and ElasticSearch on CoreOS and Docker.
I hope you will find this interesting and use it as a basis for an amazing project. Do make sure to follow @weavenetwork, and keep an eye on this blog for further posts. I would like take the opportunity to thank Holden and Alexis for making their code available, those examples helped me a lot to get started. I also would like to thank David Pollak (@dpp) for helping me to hack on this.
We are very excited here at Weaveworks with today’s news, it’s great times ahead! Most importantly, we are pretty well set to build more great tools for you.