Juju Big Data community
When exploring big data solutions, one of the most daunting tasks users face is the setup and configuration of these usually complex environments. This can take from hours to days; time you could spend testing, evaluating, and putting your big data solutions to good use.
The mission of the Juju Big Data team is to offer a simple and repeatable method for deploying big data environments. We’ve created a “pluggable” model using Juju charms and bundles to let users focus on the fun part (actually solving big data problems) without worrying about the intricacies of configuring core Hadoop services.
This post provides an overview of the foundation we’ve built for interfacing with Hadoop, as well as extensions to that foundation that make for totally awesome demos. You’ll also find a few videos related to our work and of course, how to contact us!
We have three main focus areas for development. First, we have a library to help with tasks that are commonly needed when doing big data development. Next are the Juju charms that model individual big data services. Finally, we have pre-canned solutions in the form of bundles that tie charms together for simple and repeatable deployment.
Where useful, we’ve included a DEV-README.md in our repositories. That document is meant to guide users interested in extending our generic offerings for more specific uses. We’ll point out the repository locations and these development documents as we get into the details of our focus areas below.
Juju big data library
jujubigdata Python library is an collection of functions and classes for
simplifying the development of big data applications. This includes things like
synchronizing /etc/hosts across a cluster, configuring core-site.xml,
interacting with Hadoop, etc.
Charms are at the heart of our development. Each charm models a particular big data service (e.g.: the NameNode, Apache Flume, Hue, etc). Most of our charm development is in Python and makes use of the Juju big data library mentioned above.
While helpful as a method to install a service, charms really shine when you relate them to other services. We’ll talk more about that when we discuss bundles below, but first, here are our current charms:
Designed to talk to other Flume agents, this charm will allow you to ingest data into HDFS in AVRO format.
Syslog comes in, AVRO events go out. Connect to
apache-flume-hdfsto shove those events into HDFS.
Tweets comes in, AVRO events go out. Connect to
apache-flume-hdfsto shove those tweets into HDFS (requires Twitter API credentials to access the firehose).
Provides an endpoint that plugs into the Hadoop cluster using the
apache-hadoop-pluginsubordinate charm. It allows users to manually run MapReduce jobs (e.g.: teragen, terasort, etc).
Connects to the HDFS and YARN masters to handle dfs and mapreduce tasks. See the DEV-README for details about this charm’s interfaces.
Provides the HDFS Master. See the DEV-README for details about this charm’s interfaces.
Provides the Secondary Namenode. See the DEV-README for details about this charm’s interfaces.
A subordinate charm that facilitates communication with the Hadoop cluster. This is designed to be deployed alongside our
apache-hadoop-clientcharm as well as our end-user service charms (e.g.:
apache-pig, etc). See the DEV-README for details about this charm’s interfaces.
Provides the YARN Master. See the DEV-README for details about this charm’s interfaces.
Provides sql-like analytics with Hive. This is designed to interact with the Hadoop cluster via our
Provides analytic capabilities using Pig. This is designed to interact with the Hadoop cluster via our
Provides the Spark execution engine, designed to back to a Hadoop cluster.
An IPython Notebook service with integrated Spark support.
The Zeppelin notebook service for interacting with a Spark+Hadoop cluster.
Bundles are groups of charms that model a solution. We’ve come up with a few that we think the big data community will find useful right away:
This bundle deploys a complete Hadoop cluster with 7 units: HDFS Master, Yarn Master, Secondary Namenode, three Compute Slaves, and a Client unit that is plugged into the cluster and ready to run big data jobs. For anyone that wants a Hadoop environment deployed, configured, and ready to do work, this is the bundle for you.
We believe this is a good place to start for users that want to build on top of a known-good Hadoop deployment. To that end, we have a DEV-README for this bundle that describes how you might want to interact with and extend this to meet your needs. You’ll notice all of our remaining bundles build off this one to showcase more specific solutions.
apache-analytics-sql [dev bundle]
Extending the core Hadoop bundle to provide analytic capabilities with Apache Hive and MySQL.
apache-analytics-pig [dev bundle]
Extending the core Hadoop bundle to provide analytic capabilities with Apache Pig.
apache-ingestion-flume [dev bundle]
Extending the core Hadoop bundle with ingestion capabilities using Apache Flume.
apache-hadoop-spark [dev bundle]
Extending the core Hadoop bundle to provide the Apache Spark execution engine.
apache-hadoop-spark-notebook [dev bundle]
Further extending the Spark bundle with IPython Notebook.
apache-hadoop-spark-zeppelin [dev bundle]
Further extending the Spark bundle with Apache Zeppelin.
realtime-syslog-analytics [dev bundle]
Combining ingestion (Flume), processing (Spark), and visualization (Zeppelin) into a log analytics solution.
We have a few presentations, videos, and other bits of media that may be helpful if you want to see what we’re all about.
- Ubuntu Online Summit, 04/2015
- Hadoop on OpenPower (part 1)
- Hadoop on OpenPower (part 2)
- Hadoop+Spark on OpenPower (part 3)
- Hadoop+Spark on OpenPower (part 4)
You can find us in
irc.freenode.net, or feel free to email/join our
list. We look forward to hearing from you!