Project X: a new kind of ad server
- Built on Winterwell's MagicMatch algorithm
- Compute: Need to compare 100M x 100M records and do it often
- Store: lots of fat "enhanced" records, plus lots of skinny transactions
- Bonus requirements: fast, highly concurrent, highly available
Hadoop: the elephant in the room
- Open-source implementation of Google infrastructure ca 2006
- Designed to index a large and growing web by scaling horizontally on commodity hardware
- A family of tools
- HDFS: Scalable storage infrastructure
- MapReduce: Scalable compute infrastructure
- And: ZooKeeper, HBase, Cassandra, Chukwa, Pig, Hive, Mahout...
Lesson 1: It's not rocket science
- It's just good old-fashioned file-oriented batch processing
- ...at massive scale
- ...in any language you like
- ...sensitive to data locality
- ...battle-tested and fault tolerant
- ...with built-in monitoring and admin tools
- Awesome!
Lesson 2: It is immature
- Surprising limits on functionality
- Poorly organised documentation
/tmp
is not a good location for persistent data
- HBase clients can easily/accidentally crash the server
- Single points of failure
Lesson 3: You can't always outrun Big O
- Version 0.1 averaged 200k cmp/sec. Yay!
- 100M records + N2 algorithm = ???
- 1,500 years
- Horizontal scaling isn't going to help
- Bugger
Lesson 4: Unlimited storage is liberating
- No need for backups (in principle)
- Keep everything
- Version everything
- Design for experimentation
Lesson 5: Don't overlook ZooKeeper
- Ships with HBase
- Minimalistic, eventually consistent, persisted, in-memory database
- Distributed locks and counters
- Leadership election
- Easy to use (via Curator)
Lesson 6: Deployment is complicated
- Ensure you have a good sysadmin
- Work closely with her
- Beware log files!
- Build a hetrogenous cluster
- Consider Hadoop 2.0
Wrap-up
- Can you build on-line systems with Hadoop? Yes!
- Can you achieve good performance and high-availability? Yes!
- Project X in production and serving ads
- What's next? More versioning; more analytics; Hadoop 2 port
Thanks! Any questions?
Follow @joehalliwell on Twitter!