Data Journalism Developer Studio 2012LX Blog
Yesterday, Evan Weaver tweeted “Twitter open source page is live! http://twitter.com/about/opensource”. This page is a fascinating peek under Twitter’s hood – the cutting edge open source technologies that power the popular microblogging service. For those of us who work with Twitter, this is required reading for career management and lifelong learning. And for those of us who are Twitter users, it’s a fascinating look at the future of the real-time web.
Ruby
As you may know, Twitter was originally a Ruby on Rails application. That’s actually where I first heard of Twitter – at RubyConf 2006. Early in 2007, I joined Twitter, and my first friends and followers were people I had met at RubyConf 2006.
As you’ll see below, Twitter has now incorporated many other technologies, but they still use Rails, and Ruby. In particular, the version of Ruby they use is Ruby Enterprise Edition (REE), a version tuned for stability and scalability.
Scala
Scala “is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way. It smoothly integrates features of object-oriented and functional languages, enabling Java and other programmers to be more productive. Code sizes are typically reduced by a factor of two to three when compared to an equivalent Java application.”
It’s not just Twitter that’s using Scala; Foursquare also uses it. Because of the high visibility of Twitter and Foursquare, I expect to hear a lot more about Scala in the coming months. For those of you in the Portland, Oregon area, there is now a Scala programmers’ group, @PDXScala.
Cassandra
Cassandra is one of the newer “Non-SQL” databases. It was originally developed at Facebook, and released as open source in 2008. The description on the Cassandra home page reads, “A highly scalable, eventually consistent, distributed, structured key-value store.” The key term (pun intended) here is “key-value store”. This is somewhat like what they used to call in the ancient days (1950s) “associative memory”. Rather than specify an object (the value) by its location, we give it a name (the key) and the system can find it.
Here’s an interview with Ryan King, Twitter’s Director of Storage, on why Twitter chose Cassandra.
Hadoop and Pig
Hadoop is another highly-scalable distributed tool. Hadoop primarily implements the MapReduce operation. MapReduce is a way of applying massive processing power to massive datasets. The concepts behind MapReduce originated in the early 1960s or even earlier, during the development of the Lisp programming language. An implementation of MapReduce has been patented by Google.
Pig is a “scripting language” designed to work with Hadoop. It simplifies the programming tasks of people working with large datasets.
Summary
While the technologies are interesting to technologists like me, what does such massive power give the Twitter user? And why does Twitter need it? Here’s a simple example: The exact arrival rates of tweets at Twitter aren’t widely publicized. I don’t know if this is a “trade secret” or not – I’ve seen estimates of these rates on blogs but I’m not sure that the way those estimates were obtained is technically valid.
However, there is a publicly-available subset of the full “Firehose” data stream available via the Streaming API, called “Sample”. There’s no official documentation on what fraction of the full Firehose comes through the Sample stream. But in a sample I collected in January, I saw a peak of 81,718 tweets in a single hour!
And what was so special about that hour? It was, to be precise, the hour between 01:00:00 and 02:00:00 UTC January 13th, 2010. The Haiti earthquake happened at 21:53:10 UTC on January 12th, 2010 – about three hours earlier. Remember – “Sample”, as the name implies, is a subset of the full tweet stream! That’s the reason Twitter needs the massive power it is getting from these cutting-edge technologies.
Update!
Todd Hoff of Highscalability.com has just published a more detailed analysis of Twitter’s use of Hadoop and Pig, including links to a presentation by Kevin Weil, Analytics Lead at Twitter. Both are highly recommended!






RT @znmeb: A Peek Under Twitter’s Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ
RT @znmeb: A Peek Under Twitter’s Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ
RT .@znmeb: A Peek Under Twitter's Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ
RT @znmeb: A Peek Under Twitter’s Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ
A Peek Under Twitter’s Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ
[...] a previous post, I focused on the technologies under the hood, so I want to focus on the people who have built this phenomenon – people we developers work [...]
RT @znmeb: RT @DZone "A Peek Under Twitter’s Hood" http://dzone.com/Rmkl
RT @DZone “A Peek Under Twitter’s Hood” http://borasky-research.net/2010/02/17/a...
RT @znmeb: RT @DZone “A Peek Under Twitter’s Hood” http://borasky-research.net/2010/02/17/a...
RT @znmeb: RT @TopsyRT: A Peek Under Twitter's Hood (Updated!) http://bit.ly/adDOM1
RT @TopsyRT: A Peek Under Twitter's Hood (Updated!) http://bit.ly/adDOM1
A Peek Under Twitter’s Hood (Updated!) | Borasky Research Journal http://borasky-research.net/2010/02/17/a...
RT @znmeb: A Peek Under Twitter’s Hood (Updated!) | Borasky Research Journal http://borasky-research.net/2010/02/17/a...
RT @znmeb: A Peek Under Twitter’s Hood (Updated!) | Borasky Research Journal http://meb.tw/bTjysZ
A Peek Under Twitter’s Hood (Updated!) | Borasky Research Journal http://meb.tw/bTjysZ
A Peek Under Twitter's Hood http://is.gd/8P17g
RT @znmeb: A Peek Under Twitter’s Hood (Updated!) « Borasky Research Journal http://meb.tw/bTjysZ
RT @znmeb: A Peek Under Twitter’s Hood (Updated!) « Borasky Research Journal http://meb.tw/bTjysZ
A Peek Under Twitter’s Hood (Updated!) « Borasky Research Journal http://borasky-research.net/2010/02/17/a...
RT @znmeb: A Peek Under Twitter’s Hood (Updated!) « Borasky Research Journal http://borasky-research.net/2010/02/17/a...
RT @znmeb: A Peek Under Twitter’s Hood (Updated!) « Borasky Research Journal http://borasky-research.net/2010/02/17/a...
RT @DZone “A Peek Under Twitter’s Hood” http://borasky-research.net/2010/02/17/a...
RT @znmeb: RT @DZone “A Peek Under Twitter’s Hood” http://borasky-research.net/2010/02/17/a...
RT @znmeb: RT @DZone “A Peek Under Twitter’s Hood” http://borasky-research.net/2010/02/17/a...