It’s been a big year for Apache Hadoop, the open source project that helps you split your workload among a rack of computers. The buzzword is now well known to your boss but still just a vague and hazy concept for your boss’s boss. That puts it in the sweet spot when there’s plenty of room for experimentation. The list of companies using Hadoop in production work grows longer each day, and it probably won’t be long before “Hadoop cluster” takes over the role that the words “crazy supercomputer” used to play in thriller movies. The next version of the WOPR is bound to run Hadoop.
The area is flourishing as the core project attracts a wide collection of helper projects that organize the workload and make it simpler to manage a collection of jobs to run at particular times. There’s HDFS, a standard file system that can organize the data spread out around the cluster; Hive, a data warehousing layer for making sense of this data; Mahout, a collection of routines for trying to learn something from said data; and ZooKeeper, a tool for keeping all of the balls in the air. At least a half-dozen or more other open source tools live in a stable orbit around Hadoop.
The open source projects are just the beginning — a surprisingly large number of companies are emerging with the plan of helping people actually use Hadoop. Some are just selling support, and others are building their own tools that sit alongside Hadoop and make it easier to use.
This kind of competition is usually seen as open source at its best. There is a core collection of packages that serve like a standard to keep everyone in synchrony. Each of the groups is competing to add the right sauce that will attract customers, both paying and nonpaying. There continues to be controversy over just how much is rolled into the central collection, as there can be in any major open source project, but the amount of experimentation is so large that it’s hard to be too focused on the amount of sharing.
Related story: Big data and the channel: go big or go home
To get a feel for the excitement, I took four major collections out for a test-drive. I powered up a cluster of nodes on Rackspace, installed the tools, pushed the buttons, and ran some sample jobs. It’s getting to be surprisingly easy to spend a few pennies for an hour or two of machine time — so much so that I found myself debating whether it was worth leaving my cluster idling over lunchtime. Lest anyone doubt the efficiency of cloud computing, I noticed that the rate for my cluster of relatively fat machines with 4GB of RAM was less than the cost to park a car around the corner. The parking meters spin faster.
The not-so-good news is that these collections are far from perfect. None of the tools I tried worked exactly as promised. There were always small glitches. I often found myself reading the log files and paging through endless lists of Java stack dumps. (Someone is going to have to apply Hadoop to analyzing the endless stack dumps. They’re getting so involved that I doubt a single machine will be able to parse them for much longer.) After a few seconds, I could usually get things on track again. These tools may not require someone with much experience to use once they’re running, but they can’t be installed unless you’re fairly adept with the ways that the Java stack is organized.
Despite these impediments, I spent most of my time churning through data. The good news is that all of these tools make it pretty easy to get a cluster of computers working together to solve problems. Using these tools is much easier than downloading and configuring the source code yourself. They’re designed to be one-button applications, and they come close to achieving that goal.
Amazon Elastic MapReduce
It should be no surprise that Amazon, one of the pioneers of cloud computing, offers a mechanism for spinning up Hadoop clusters on its EC2 cloud. Elastic MapReduce is tightly integrated with all of Amazon’s other elastic offerings, and it sits as another tab on the Amazon Web Services main page. You store your data in S3, then fire up a job to churn through it.
The integration is nicely done. Amazon provides a Java-based Web interface that does a great job of hand-holding, taking care of many of the glitches that often occur when you’re first trying software. When it wanted to store data in an S3 bucket, it flipped me over to a page for creating the bucket.
If the Web GUI is a bit too babyish, there’s also a classic Web service API that’s been wrapped up in software by a number of other programmers. I played a bit with a Ruby-based collection of tools that submits the jobs and starts them running. The standard start and end is the S3 cloud.
With Elastic MapReduce, Amazon is essentially offering nicer packaging on top of EC2 for those who are willing to plunge deeper into Amazon Web Services. I could have built my own cluster of machines on EC2 and used any of the Hadoop distros to spin them up, but Elastic MapReduce offers a nice set of shortcuts. Amazon has already built and integrated the infrastructure, and you just push the buttons to choose which version of Hadoop (0.18 or 0.2) you want to use. There’s no need to worry about which version of Linux is running underneath.
The infrastructure is quite nice. You can choose to pay a stock price for your machines or just bid for empty machines on the spot market. This is the kind of extra feature that thrills the free-market fans, but I found it confusing. You choose your bid and take your chances. If you bid too little, you could end up waiting a long time, perhaps even forever.
It should be noted that the cloud doesn’t respond instantaneously. It took from 5 to 18 minutes to execute tiny jobs that would take microseconds to execute on a fully configured cluster in your own server closet. The overhead wouldn’t make a difference for a big job, but it’s not the same as having your own cluster waiting patiently for you to push the Start button.
Taking advantage of all of these features means buying into Amazon’s storage system. If you’re already using S3 for your data, you’ll be ready to go. If you’re not, you’ll have to make some decisions. Some people find that S3 is too expensive for bulk data that’s rarely accessed. You’re paying for all of the engineering that’s been built for people who need a fairly good response time, and that price is built into the retrieval costs.
I think all of Amazon’s extra features are good options for two classes of users. If you already have most of the relevant data in Amazon’s cloud, Elastic MapReduce makes it easy to spin up jobs to analyze it. The piping is already well in place.
The other group would be those who don’t need a cluster most of the time but want to do short, intensive calculations once a week, once a month, or once a quarter. It’s not much work to create a full Hadoop cluster using the other tools in this review, but it’s kind of silly to request new machines from scratch every now and again. Amazon offers a nice shortcut to uploading a Python script or a JAR file and going straight to computation.
Cloudera CDH, Manager, and Enterprise
Cloudera is a startup that has collected Hadoop experts from all of the major companies using Hadoop. The CTO came from Yahoo, the chief scientist from Facebook, and the CEO from Oracle. The staff is filled with the names of people who learned Hadoop by building it.
The company is selling training, support, professional services, and some tools for managing your cluster. The Cloudera distribution and basic manager are free for clusters with fewer than 50 machines, while the subscription-based enterprise edition offers many more features for handling standard data formats.
The free version is quite useful for starting up a cluster and monitoring the jobs as they flow through the system. The manager takes a list of IP addresses, logs into all of them with SSH, and installs the major tools.
The automation makes it pretty easy to run the Cloudera distro, but I still had to patch a few glitches to install it on CentOS. One component wanted a certain version of zip, and it ground to a halt until I logged int
Hortonworks Data Platform
I wanted to test the Hortonworks distribution, but it wasn’t ready when I was writing. The company will be concentrating on selling training and support while avoiding creating proprietary extensions.
“We are an open source company,” Eric Baldeschwieler, the CEO, told me. “The only product we have is open source. We won’t commit to never selling anything, but you won’t see anything in the next year. We’re committed to a complete open source, horizontal platform. We want people to be able to download everything they want for free. That differentiates us from everyone else in the market.”
Indeed, the company employs a number of people with a deep knowledge of Hadoop gained from years at Yahoo. The company formally separated from Yahoo last year, and now it’s looking for partnerships to support their work.
Hortonworks is currently running a private beta. I couldn’t join it, but perhaps your company will be able to participate. In the meantime, you can grab Hadoop directly from Apache. It’s guaranteed to be pretty close to what Hortonworks will be shipping, at least for the next year.
Choosing a Hadoop
There’s no easy way to summarize the quickly shifting space. Each of these companies is pointed in a slightly different direction. They may all agree that the Hadoop collection of software is a great way to spread out work over a cluster, but they each have different visions of who would want to do this and, more important, how to accomplish it. The similarities are fewer than you might expect.
The biggest differences may be in how you handle your data. The idea of making your data accessible through NFS may be one of the neatest innovations, but MapR is introducing some risk by breaking from the pack and adding its own proprietary extensions. MapR’s claims for great speed and better throughput are tantalizing, but there’s also the danger of bugs or mistakes appearing because of incompatibility. Just as in horror movies, bad things can happen when you split up and strike off on your own.
Amazon’s system also imposes its own limitations. It’s easiest if you’ve already decided to park your information in S3. If you’ve decided that this is a good place for it, you won’t notice. If you’re not so sure, you’ll have to adapt.
Much will also depend on how you’re using your data. I think Amazon’s cloud may be the simplest way to knock off fast jobs that are run occasionally, but it’s not the only choice. Both IBM and Cloudera make it relatively easy to set up and run a cluster. After doing it a few times, I found I could knit together a small cluster of Rackspace cloud machines in just a few minutes. It’s not simple, but it’s not too hard either.
My guess is that most folks will want to use Cloudera, IBM, or MapR for permanent clusters in permanent clouds. Although it’s tempting to spin up a rack for a bit of work, it probably makes more sense to leave it up and running just to simplify the process of migrating the data. I suspect that most Hadoop work involves more data juggling than raw computation. Leaving the cluster up and running makes it possible to move the data in and out.
Another big difference is how the companies are adding features for processing different types of data. IBM includes Lucene, a text search engine for building indices. Cloudera offers intelligent log search. I think that these sorts of Hadoop add-ons will only become more common.
These additions will also put more stress on the open source foundation of Hadoop. In many ways, the proliferation of different approaches shows the strength of the open source approach. The commercial vendors can collaborate on the care while competing on which extra features to add to the mix.
Still, there’s bound to be some tension between the companies as they add material. I’m hopeful that the spirit of cooperation will continue to be strong enough to keep everyone working together, but there’s no reason to assume that it will always be so. When people get good ideas, they’ll roll them into their own clusters first and they may or may not contribute them back to the original distribution.