HadoopHackDay was a major hit
Last weekend we had a Hadoop HackDay at the SlideShare office. Ten teams competed in all, including 3 teams from outside slideshare and 7 from slideshare. Nobody slept a wink (OK I may have dozed off for an hour or so, but the contestants kept cranking out code through the night). The hacks were uniformally impressive ... recommender systems, personalization engines, and classification systems, and operations monitoring / analytics were the most common themes.
Here's the top things I learned:
-Hack days rule as a way of learning new things. In a day, we were able to go from not knowing much about Hadoop to being able to build real systems with it. The competitive motivation + the ability to learn from each other really accelerated individual learning to an almost unbelievable extent. Everyone who competed produced working code that did something at least moderately impressive, even though only one team had any previous Hadoop experience!
-Infrastructure and Architecture are coming closer together, in that software engineers need to understand a LOT more about the core infrastructure in order to do their jobs. At times the hackday seemed as much an #awshackday as a #hadoophackday. Learning how to use ElasticMapReduce, EC2, S3, and Elastic Block Storage were as central to the experience as learning how to code in Pig and Hive.
-Pig and Hive, are very powerful languages for scripting Hadoop jobs. The final programs submitted by participants were often less than 100 lines long, yet performed very powerful transformations on large data sets. The learning curve was manageable, certainly much less than learning a new high-powered language like python or ruby.
-Elastic MapReduce (from Amazon) is the ultimate gateway drug for parellel computing. Every participant was able to start running simple hadoop programs in less than an hour, without installing anything on their laptops!). However, the versions of Hadoop and Pig that come with it are quite old, and given the number of nodes one will need in production, it will be much cheaper to run the cloudera distribution of hadoop on ec2 machines that you rent on the amazon spot market. For experimentation, it's hard to argue against Elastic MapReduce. The hosting bill for the entire hackday came to 22$!
-Hadoop is very resource-intensive! We started out using 1-node clusters to run our jobs against small subsets of data. Very quickly teams started upgrading to 5-node clusters due to the amount of time they were having to wait for results. Final runs against full data sets were powered by 10-node clusters of "medium" ec2 servers. You have no choice but to use cloud computing for these kinds of jobs, because it seems to me that production use could easily require 100s of nodes, and no one would want to buy that many servers for machines that they only use one hour a day.
-Moving data to the compute cluster was more of a limiting factor than we had anticipated. Most people wanted to work on BIG (at least 1GB) data sets, and copying that from the slideshare cluster to s3, then from s3 to the hadoop file sytem took a lot of time. If this is a limiting factor for your app, you'll need host your whole app in the cloud, or use a physical hosting provider who also provides cloud computing services (like softlayer or rackspace).
That's it! We can't wait to start diving into using hadoop in production, and we'll probably organize more public hackdays in the next few months, since this one was so successful.