Building an online system with Hadoop

Joe Halliwell // @joehalliwell
Winterwell Associates Ltd

Project X: a new kind of ad server

Built on Winterwell's MagicMatch algorithm
Compute: Need to compare 100M x 100M records and do it often
Store: lots of fat "enhanced" records, plus lots of skinny transactions
Bonus requirements: fast, highly concurrent, highly available

Hadoop: the elephant in the room

Open-source implementation of Google infrastructure ca 2006
Designed to index a large and growing web by scaling horizontally on commodity hardware
A family of tools
- HDFS: Scalable storage infrastructure
- MapReduce: Scalable compute infrastructure
- And: ZooKeeper, HBase, Cassandra, Chukwa, Pig, Hive, Mahout...

A confession

Lesson 1: It's not rocket science

It's just good old-fashioned file-oriented batch processing

...at massive scale
...in any language you like
...sensitive to data locality
...battle-tested and fault tolerant
...with built-in monitoring and admin tools

Awesome!

Lesson 2: It is immature

Surprising limits on functionality
Poorly organised documentation
/tmp is not a good location for persistent data
HBase clients can easily/accidentally crash the server
Single points of failure

Lesson 3: You can't always outrun Big O

Version 0.1 averaged 200k cmp/sec. Yay!
100M records + N² algorithm = ???
1,500 years
Horizontal scaling isn't going to help
Bugger

Lesson 4: Unlimited storage is liberating

No need for backups (in principle)
Keep everything
Version everything
Design for experimentation

Lesson 5: Don't overlook ZooKeeper

Ships with HBase
Minimalistic, eventually consistent, persisted, in-memory database
Distributed locks and counters
Leadership election
Easy to use (via Curator)

Lesson 6: Deployment is complicated

Ensure you have a good sysadmin
Work closely with her
Beware log files!
Build a hetrogenous cluster
Consider Hadoop 2.0

Wrap-up

Can you build on-line systems with Hadoop? Yes!
Can you achieve good performance and high-availability? Yes!
Project X in production and serving ads
What's next? More versioning; more analytics; Hadoop 2 port

Thanks! Any questions?

Follow @joehalliwell on Twitter!