I spoke to Daniel Abadi this morning about his HadoopDB announcement that came out a couple of days back. I am sure this has been a busy time for Daniel and his team over in Yale as HadoopDB has been getting a lot of interest which I am sure will continue to build.
Some notes from our discussion:
- HadoopDB is primarily focused on high scalability and the required availability at scale. Daniel questions current MPP’s ability to truly scale past 100 nodes whereas Hadoop has real examples on 3000+ nodes.
- HadoopDB like many MPP analytical database platforms uses shared nothing relational database as processing units. HadoopDB uses Postgres. Unlike other MPP databases, HadoopDB uses Hadoop as the distributed mechanism.
- I am adlibbing here, but I understand that Daniel doesn’t dispute DeWitt & Stonebrakers (and his) paper which claims Map/Reduce underperforms when compared to current MPP DBMS. HadoopDB however is focused on massive scale, hundreds or thousands of nodes. Currently the largest MPP database we know of is 96 nodes.
- Early benchmarking shows HadoopDB outperforms Hadoop but is slower than current MPP databases under normal circumstances. However when simulating node failure mid query HadoopDB outperformed current MPP databases significantly.
- The higher the scalability the higher the possibility of node failure mid query. Very large Hadoop deployments may experience at least 1 node failure per query (job).
- HadoopDB is usable today, but should not be considered an “out of the box” solution. HadoopDB is an outcome from a database research initiative, not a commercial venture. Anyone planning to use HapoopDB will require the appropriate systems & development skills to effectively deploy.
HadoopDB is an innovative approach to the scalability challenges that continue to push the architecture of the modern database forward.