Last week I spent some time speaking with Kevin Weil, head of analytics at Twitter. Twitter, from a technology perspective, has had a bit of a hard time due to their stability issues in their early days. Kevin was keen to point out that he feels this was due to the incomparable growth Twitter was experiencing at the time and their constant struggle to keep up. Kevin was also keen to show that Twitter prides themselves on striving for engineering excellence, the creation & contribution to new technologies and generally assisting in pushing the boundaries forward. Our conversation naturally centered on analytics at Twitter.
Twitter, like many web 2.0 apps, started life as a MySQL based RBDMS application. Today, Twitter is still using MySQL for much of their online operational functionality (although this is likely to change in the near future – think distributed), but on the analytics side of things Twitter has spent the last 6 months moving away from running SQL queries against MySQL data marts. This was because their need for timely data was becoming a struggle with MySQL, particularly when dealing with very large data volumes and complicated queries. For Web 2.0 the ability to understand, quantify and make timely predictions from user behavior is very much their life blood. When Kevin arrived at Twitter 6 months ago he was tasked with changing the way Twitter analyzed their data. Now the bulk of their analytics is executed using a Hadoop platform with Pig as the “querying language”.
Hadoop is a distributed shared-nothing cluster which locates data throughout the cluster using a virtualized file system. What has made Hadoop particularly popular for large scale deployment is the comparative ease of writing distributed functions through a process known as map/reduce. Map/reduce hides much of the complexity of running distributed functions, even when running over a very large numbers of nodes. This allows the developer to focus on their “application logic” rather than worrying about specifics of the execution process (Hadoop handles distribution of execution, node failures, etc). But in saying this, expressing complicated application logic directly in map/reduce functions can become quite laborious as many pipelined map/reduce functions may be required to take raw data through to a useful processed result. Because of this complexity several higher level scripting languages have appeared to abstract this.
Pig is one such scripting language for Hadoop. Pig takes the developers requirement expressed in the script and produces the underlying map-reduce jobs that are executed on Hadoop. This abstraction is incredibly important as without it the complexity of expressing difficult analytical ‘queries’ directly in map/reduce would be highly time consuming & error prone. This can be thought of as being similar to the way SQL is a higher level abstraction language that hides all the query plan routines (written in C) that operate on the data in a traditional RDBMS. Of course abstraction provides increased efficiency in creating analytical routines, but comes at a performance cost. Kevin quantified his experience, he found typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.
Ok, so why use Hadoop and Pig instead of more traditional approach like an MPP RDBMS? Kevin explained that there were a few reasons for this. Firstly Twitter, like many Web 2.0 companies, is committed to open source and likes to use software that has a low entry cost but also allows them to contribute to the code base. Kevin mentioned that Twitter did look at some of the open source MPP RDBMS platforms but were less than convinced of their ability to scale to meet their needs at the time. And the second reason is exactly that, scale. Twitter is understandably coy on their exact numbers, but they have hundreds of Terabytes of data (but less than a Petabyte) and one could assume that to get reasonable performance they are running Hapdoop on a few dozen nodes (this is a guess, Twitter didn’t say). As they grow analytics will become more important to their business, this may expand to hundreds (or thousands) of nodes. A “few hundred” nodes is right on the upper limit on what is possible today with the world’s most advanced MPP RBDMS’s. Hapdoop clusters, on the other hand, grow well into the hundreds and even the thousands of nodes (e.g. at Google, Facebook etc).
So Hadoop was the platform choice, but why Pig? There are other “analytical” scripting languages that sit over Hadoop, notably Hive which was popularized by Facebook (Pig was popularized by Yahoo). On discussing the merits of Pig vs Hive it became apparent that Hive was more in tune with a traditional approach (“database like”). Hive requires data to be mapped to a given structure and the queries (using a SQL like derivative) are submitted against that schema. Pig on the other hand is less prescriptive in terms of schema and individual queries can define the structure of the data for that execution. In addition, Pig is more of a “procedural” language allowing the complicated data flow process to be more easily controlled and understood by the developers.
So, as mentioned, Hapdoop is a batch based job processing platform. Jobs (in this case map/reduce jobs generated from the Pig queries) are submitted and results are returned sometime in the future. Exactly when in the future varies from a few minutes (e.g. they run jobs hourly which only take a few minutes to run) through to many hours for jobs that run over much larger sets of data. This leaves a gap in “near real-time” analytics between the lightweight queries they can run on the transactional system and the more intense Hadoop based analytics. This has been a space that Twitter has been investigating solutions to fill. This space will be used for things like improved abuse detection, issue analysis and so on. Twitter is currently considering their data platform options here including Cassandra, HBase and may even decide to use a closed sourced MPP solution to fill this need (I can’t say what, sorry) due to the lack of suitable open source MPP alternatives.
For more technical info on Twitters use of Hadoop and Pig you can check out Kevin’s slide deck from the recent NoSQL East conference.