January 10, 2010
HadoopHackDay was a major hit
Last weekend we had a Hadoop HackDay at the SlideShare office. Ten teams competed in all, including 3 teams from outside slideshare and 7 from slideshare. Nobody slept a wink (OK I may have dozed off for an hour or so, but the contestants kept cranking out code through the night). The hacks were uniformally impressive ... recommender systems, personalization engines, and classification systems, and operations monitoring / analytics were the most common themes.
Here's the top things I learned:
-Hack days rule as a way of learning new things. In a day, we were able to go from not knowing much about Hadoop to being able to build real systems with it. The competitive motivation + the ability to learn from each other really accelerated individual learning to an almost unbelievable extent. Everyone who competed produced working code that did something at least moderately impressive, even though only one team had any previous Hadoop experience!
-Infrastructure and Architecture are coming closer together, in that software engineers need to understand a LOT more about the core infrastructure in order to do their jobs. At times the hackday seemed as much an #awshackday as a #hadoophackday. Learning how to use ElasticMapReduce, EC2, S3, and Elastic Block Storage were as central to the experience as learning how to code in Pig and Hive.
-Pig and Hive, are very powerful languages for scripting Hadoop jobs. The final programs submitted by participants were often less than 100 lines long, yet performed very powerful transformations on large data sets. The learning curve was manageable, certainly much less than learning a new high-powered language like python or ruby.
-Elastic MapReduce (from Amazon) is the ultimate gateway drug for parellel computing. Every participant was able to start running simple hadoop programs in less than an hour, without installing anything on their laptops!). However, the versions of Hadoop and Pig that come with it are quite old, and given the number of nodes one will need in production, it will be much cheaper to run the cloudera distribution of hadoop on ec2 machines that you rent on the amazon spot market. For experimentation, it's hard to argue against Elastic MapReduce. The hosting bill for the entire hackday came to 22$!
-Hadoop is very resource-intensive! We started out using 1-node clusters to run our jobs against small subsets of data. Very quickly teams started upgrading to 5-node clusters due to the amount of time they were having to wait for results. Final runs against full data sets were powered by 10-node clusters of "medium" ec2 servers. You have no choice but to use cloud computing for these kinds of jobs, because it seems to me that production use could easily require 100s of nodes, and no one would want to buy that many servers for machines that they only use one hour a day.
-Moving data to the compute cluster was more of a limiting factor than we had anticipated. Most people wanted to work on BIG (at least 1GB) data sets, and copying that from the slideshare cluster to s3, then from s3 to the hadoop file sytem took a lot of time. If this is a limiting factor for your app, you'll need host your whole app in the cloud, or use a physical hosting provider who also provides cloud computing services (like softlayer or rackspace).
That's it! We can't wait to start diving into using hadoop in production, and we'll probably organize more public hackdays in the next few months, since this one was so successful.
December 03, 2009
Why Amazon Reserved Instances don't make economic sense for startups
At slideshare we spend a LOT of money on Amazon Web Services, especially EC2. We love AWS because the pay-as-you go pricing model means that we never invest in servers that we aren't ready to use yet. But earlier this year, Amazon released the ability to prepay for a "reserved EC2 instance" (see my initial reaction here). In exchange for paying a fee, you get the right to consume instance-hours at 1/3 of the standard rate. You can do this for either one or three years (three years simply requires a larger fee). The pricing of both reserved instances and standard instances have recently been discounted.
I was curious whether this would be a good deal for slideshare, so I modeled out the cost over a three-year period for one large instance. I modeled 3 scenarios:
1) paying as you go (the way we do currently)
2) paying the 1-year reserved instance fee every year
3) paying the 3-year reserved instance fee
The discounting is identical across different instance types (I spend a little time double-checking this), so my conclusions should be relevant to you even if you use small or medium instances.
Here's the spreadsheet.
The results surprised me quite a bit. Some quick observations:
1) Amazon bills at the end of the month (after usage). But the prepay happens at the beginning of the month (before usage). This pushes out the "break-even" point for an investment in a dedicated instance 1 month further than you might think. For the 1-year plan (the only one worth considering IMHO) this happens in the seventh month.
2) Discounts are not as generous as they appear. As a result, it ONLY makes sense to consider a dedicated instance for a machine that will be running 24 hours a day, 7 days a week.
3) Amazon pricing is rapidly being discounted. Locking your prices in for 3 years is almost certainly not beneficial to you at this point, given the small spread between the discounts (the difference is 18%. Given Amazon's track record, betting that they will not discount their services by 18% in the next three years is very risky).
4) A 30% a year discount (which is what you get with the one-year prepay option I model) will certainly be attractive to many small businesses or larger companies. After all, a 30% yearly return on an investment is pretty good. But a startup will almost always have something else it can invest in that will pay better than 30%/year. For us it's engineering: the faster we can improve the slideshare experience, the more money comes in the front door for us. The cash flow properties of amazon's core pricing model (paying for the infrastructure you need after you use it) are pretty darned hard to beat.
Conclusion: 1-year instances may be a good choice for many customers. But most venture-backed, bootstrapped, or rapidly growing companies should just stick with the default Amazon pricing. So we won't be investing in Amazon Reserved Instances right now. We'll just rely on the steady discounting from Amazon to drive our infrastructure costs down over time.
March 12, 2009
EC2 Reserved Instances: are they a good deal?
Amazon today announced a plan that makes EC2 boxes a bit cheaper to rent for customers who use the box 24/7. It's called a "Reserved Instance", and it basically means you pay a certain amount up front in exchange for a large discount for either one or three years.
Is this a good deal? As always, it depends. Let's assume that you actually use a certain base number of servers from amazon 24/7 (like we do at slideshare). Ignoring bandwidth costs, a small instance costs $.10/hr * 24 hrs *365 days = 876$/year (or 73$/month).
With the one-year plan you'd pay 325$ up front, and ($.03/hr *24 hrs * 365 days), which ads up to 325 + 263 = $587, or a 32% savings. Your monthly cost ends up at 49$/ month.
Things get better on the three year plan. Here you pay $500 in exchange for the right to the .03/hr pricing for three years. Your total cost ends up being $500+ ($.03/hr *24 hrs * 365 days * 3 years), which is $1289 for three years, a 52% savings. Your monthly cost comes down to $35 / month.
So this seems like a good deal, but there's some caveats. You have to pick what size instance you are going to prepay for: if you prepay for a small and it turns out you need a medium, there is no recourse. Also, you are having to pay money up front, which is definitely a negative (one of the great properties of AWS is the "pay by the drink" model which lets you pay for services AFTER you use them rather than before. This is obviously great for your cash-flow). Finally, reserved instances are not available for Windows servers yet, only for Linux ones.
A 52% discount is nothing to sneeze at, so if you're sure you're going to be using a particular machine type 24/7, it makes sense to take advantage of this program. A smart way to do it might be to move one machine over, and then pay for subsequent reserved instances over time with the savings. This way you can avoid committing too much money up front (which is never a good idea, especially in a recession).
November 13, 2006
Web site monitoring service recommendations
Can anyone recommend a good website monitoring service (doesn't have to be free)?
I need sms and email alerts, the ability to send alerts when response time goes up or when a page contains particular text, and a minimum of false positives.
So far I've tested the following freebies and found them lacking in one way or another:
site247 (false positives: otherwise would have seemed the best option)
Mon.itor.us (tests only once a day. Very confusing interface)
Montastic (way too basic)
Next up for evaluation are the following paid services: if anyone has any experience with these, or has other services they think I should try, post a comment below!
siteuptime
websitepulse
alertra
doc-com monitor
internetseer
hyperspin
webmetrics
hosttracker
siterecon
watchmouse
11/15 Correction: mon.itor.us actually tests much more often than once a day.
August 24, 2006
Utility Computing is here: meet the Amazon Elastic Compute Cloud
Amazon has been pushing the limits of distributed computing, offering very useful, reasonably priced computing services like their awesome online storage service (S3), and their queuing service (SQS). Now they’ve released something MUCH more generic and powerful: a hosting infrastructure that lets you preconfigure your desired servers (by giving Amazon a disk image of a Linux machine). It's called the Elastic Compute Cloud, or EC2 for short. When you want a server, you can then order it via the website and have it online within minutes. Pricing is a very reasonable 10 cents an hour (72$/month) plus bandwidth. Each instance provides the computing equivalent of a dedicated system with a 1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth.

October 21, 2005
ZDNet snafu: top 25 on-demand providers
Yesterday, I found an article on zdnet called "Top 25 on-demand providers". The article did not live up to it's title: in fact, it was remarkably content-free! A little digging turned up the backstory : after ZDNet published the story, the analysts that created the list asked ZDNet to remove the content.
Now ZDNet really shouldn't be publishing content that they don't have a license to. But they shouldn't edit stories beyond recognition, either. If an articles main content must be removed, the best thing to do would be remove the article entirely, not castrate it beyond recognition. Thankfully, for those interested in the on-demand software space, there's a google cache still available with the complete list. The list is also below. Enjoy! [via ken novak]
October 20, 2005
Megatrend alert: Rich Clients, Web Services, and On Demand Software
The major trends in IT today reinforce each other in a powerful way. The two technology trends (Web Services and Rich Clients) are tailor-made for the new business-model trend (On Demand Software). The two technology trends also reinforce each each other, creating a self-reinforcing web of interactions that will accelerate once it gains momentum, and may not stop until it has absorbed most of the software world as we know it!
