Building an online system with Hadoop

Joe Halliwell // @joehalliwell
Winterwell Associates Ltd

Project X: a new kind of ad server

  • Built on Winterwell's MagicMatch algorithm
  • Compute: Need to compare 100M x 100M records and do it often
  • Store: lots of fat "enhanced" records, plus lots of skinny transactions
  • Bonus requirements: fast, highly concurrent, highly available

Hadoop: the elephant in the room

  • Open-source implementation of Google infrastructure ca 2006
  • Designed to index a large and growing web by scaling horizontally on commodity hardware
  • A family of tools
    • HDFS: Scalable storage infrastructure
    • MapReduce: Scalable compute infrastructure
    • And: ZooKeeper, HBase, Cassandra, Chukwa, Pig, Hive, Mahout...

A confession

Lesson 1: It's not rocket science

  • It's just good old-fashioned file-oriented batch processing
    • ...at massive scale
    • ...in any language you like
    • ...sensitive to data locality
    • ...battle-tested and fault tolerant
    • ...with built-in monitoring and admin tools
  • Awesome!

Lesson 2: It is immature

  • Surprising limits on functionality
  • Poorly organised documentation
  • /tmp is not a good location for persistent data
  • HBase clients can easily/accidentally crash the server
  • Single points of failure

Lesson 3: You can't always outrun Big O

  • Version 0.1 averaged 200k cmp/sec. Yay!
  • 100M records + N2 algorithm = ???
  • 1,500 years
  • Horizontal scaling isn't going to help
  • Bugger

Lesson 4: Unlimited storage is liberating

  • No need for backups (in principle)
  • Keep everything
  • Version everything
  • Design for experimentation

Lesson 5: Don't overlook ZooKeeper

  • Ships with HBase
  • Minimalistic, eventually consistent, persisted, in-memory database
  • Distributed locks and counters
  • Leadership election
  • Easy to use (via Curator)

Lesson 6: Deployment is complicated

  • Ensure you have a good sysadmin
  • Work closely with her
  • Beware log files!
  • Build a hetrogenous cluster
  • Consider Hadoop 2.0

Wrap-up

  • Can you build on-line systems with Hadoop? Yes!
  • Can you achieve good performance and high-availability? Yes!
  • Project X in production and serving ads
  • What's next? More versioning; more analytics; Hadoop 2 port

Thanks! Any questions?

Follow @joehalliwell on Twitter!