Getting started


Creating Big Data solutions with Juju

Juju Big Data community

When exploring big data solutions, one of the most daunting tasks users face is the setup and configuration of these usually complex environments. This can take from hours to days; time you could spend testing, evaluating, and putting your big data solutions to good use.

The mission of the Juju Big Data team is to offer a simple and repeatable method for deploying big data environments. We’ve created a “pluggable” model using Juju charms and bundles to let users focus on the fun part (actually solving big data problems) without worrying about the intricacies of configuring core Hadoop services.

This post provides an overview of the foundation we’ve built for interfacing with Hadoop, as well as extensions to that foundation that make for totally awesome demos. You’ll also find a few videos related to our work and of course, how to contact us!

Code

We have three main focus areas for development. First, we have a library to help with tasks that are commonly needed when doing big data development. Next are the Juju charms that model individual big data services. Finally, we have pre-canned solutions in the form of bundles that tie charms together for simple and repeatable deployment.

Where useful, we’ve included a DEV-README.md in our repositories. That document is meant to guide users interested in extending our generic offerings for more specific uses. We’ll point out the repository locations and these development documents as we get into the details of our focus areas below.

Juju big data library

The jujubigdata Python library is an collection of functions and classes for simplifying the development of big data applications. This includes things like synchronizing /etc/hosts across a cluster, configuring core-site.xml, interacting with Hadoop, etc.

Charms

Charms are at the heart of our development. Each charm models a particular big data service (e.g.: the NameNode, Apache Flume, Hue, etc). Most of our charm development is in Python and makes use of the Juju big data library mentioned above.

While helpful as a method to install a service, charms really shine when you relate them to other services. We’ll talk more about that when we discuss bundles below, but first, here are our current charms:

  • apache-flume-hdfs [dev repo | dev charm]

    Designed to talk to other Flume agents, this charm will allow you to ingest data into HDFS in AVRO format.

  • apache-flume-syslog [dev repo | dev charm]

    Syslog comes in, AVRO events go out. Connect to apache-flume-hdfs to shove those events into HDFS.

  • apache-flume-twitter [dev repo | dev charm]

    Tweets comes in, AVRO events go out. Connect to apache-flume-hdfs to shove those tweets into HDFS (requires Twitter API credentials to access the firehose).

  • apache-hadoop-client [dev repo | dev charm | stable repo | stable charm]

    Provides an endpoint that plugs into the Hadoop cluster using the apache-hadoop-plugin subordinate charm. It allows users to manually run MapReduce jobs (e.g.: teragen, terasort, etc).

  • apache-hadoop-compute-slave [dev repo | dev charm | stable repo | stable charm]

    Connects to the HDFS and YARN masters to handle dfs and mapreduce tasks. See the DEV-README for details about this charm’s interfaces.

  • apache-hadoop-hdfs-master [dev repo | dev charm | stable repo | stable charm]

    Provides the HDFS Master. See the DEV-README for details about this charm’s interfaces.

  • apache-hadoop-hdfs-secondary [dev repo | dev charm | stable repo | stable charm]

    Provides the Secondary Namenode. See the DEV-README for details about this charm’s interfaces.

  • apache-hadoop-plugin [dev repo | dev charm | stable repo | stable charm]

    A subordinate charm that facilitates communication with the Hadoop cluster. This is designed to be deployed alongside our apache-hadoop-client charm as well as our end-user service charms (e.g.: apache-hive, apache-pig, etc). See the DEV-README for details about this charm’s interfaces.

  • apache-hadoop-yarn-master [dev repo | dev charm | stable repo | stable charm]

    Provides the YARN Master. See the DEV-README for details about this charm’s interfaces.

  • apache-hive [dev repo | dev charm | stable repo | stable charm]

    Provides sql-like analytics with Hive. This is designed to interact with the Hadoop cluster via our apache-hadoop-plugin charm.

  • apache-pig [dev repo | dev charm | stable repo | stable charm]

    Provides analytic capabilities using Pig. This is designed to interact with the Hadoop cluster via our apache-hadoop-plugin charm.

  • apache-spark [dev repo | dev charm | stable repo | stable charm]

    Provides the Spark execution engine, designed to back to a Hadoop cluster.

  • apache-spark-notebook [dev repo | dev charm]

    An IPython Notebook service with integrated Spark support.

  • apache-zeppelin [dev repo | dev charm | stable repo | stable charm]

    The Zeppelin notebook service for interacting with a Spark+Hadoop cluster.

Bundles

Bundles are groups of charms that model a solution. We’ve come up with a few that we think the big data community will find useful right away:

This bundle deploys a complete Hadoop cluster with 7 units: HDFS Master, Yarn Master, Secondary Namenode, three Compute Slaves, and a Client unit that is plugged into the cluster and ready to run big data jobs. For anyone that wants a Hadoop environment deployed, configured, and ready to do work, this is the bundle for you.

We believe this is a good place to start for users that want to build on top of a known-good Hadoop deployment. To that end, we have a DEV-README for this bundle that describes how you might want to interact with and extend this to meet your needs. You’ll notice all of our remaining bundles build off this one to showcase more specific solutions.

  • apache-analytics-sql [dev bundle]

    Extending the core Hadoop bundle to provide analytic capabilities with Apache Hive and MySQL.

  • apache-analytics-pig [dev bundle]

    Extending the core Hadoop bundle to provide analytic capabilities with Apache Pig.

  • apache-ingestion-flume [dev bundle]

    Extending the core Hadoop bundle with ingestion capabilities using Apache Flume.

  • apache-hadoop-spark [dev bundle]

    Extending the core Hadoop bundle to provide the Apache Spark execution engine.

    • apache-hadoop-spark-notebook [dev bundle]

      Further extending the Spark bundle with IPython Notebook.

    • apache-hadoop-spark-zeppelin [dev bundle]

      Further extending the Spark bundle with Apache Zeppelin.

    • realtime-syslog-analytics [dev bundle]

    Combining ingestion (Flume), processing (Spark), and visualization (Zeppelin) into a log analytics solution.

Media

We have a few presentations, videos, and other bits of media that may be helpful if you want to see what we’re all about.

Contact

You can find us in #juju on irc.freenode.net, or feel free to email/join our list. We look forward to hearing from you!