Borasky Research Journal Google+ Page
Borasky Research Journal Amazon Store
 

Data Journalism Developer Studio 2012LX Blog


Yesterday, Evan Weaver tweeted “Twitter open source page is live! http://twitter.com/about/opensource”. This page is a fascinating peek under Twitter’s hood – the cutting edge open source technologies that power the popular microblogging service. For those of us who work with Twitter, this is required reading for career management and lifelong learning. And for those of us who are Twitter users, it’s a fascinating look at the future of the real-time web.

Ruby

As you may know, Twitter was originally a Ruby on Rails application. That’s actually where I first heard of Twitter – at RubyConf 2006. Early in 2007, I joined Twitter, and my first friends and followers were people I had met at RubyConf 2006.

As you’ll see below, Twitter has now incorporated many other technologies, but they still use Rails, and Ruby. In particular, the version of Ruby they use is Ruby Enterprise Edition (REE), a version tuned for stability and scalability.

Scala

Scala “is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way. It smoothly integrates features of object-oriented and functional languages, enabling Java and other programmers to be more productive. Code sizes are typically reduced by a factor of two to three when compared to an equivalent Java application.”

It’s not just Twitter that’s using Scala; Foursquare also uses it. Because of the high visibility of Twitter and Foursquare, I expect to hear a lot more about Scala in the coming months. For those of you in the Portland, Oregon area, there is now a Scala programmers’ group, @PDXScala.

Cassandra

Cassandra is one of the newer “Non-SQL” databases. It was originally developed at Facebook, and released as open source in 2008. The description on the Cassandra home page reads, “A highly scalable, eventually consistent, distributed, structured key-value store.” The key term (pun intended) here is “key-value store”. This is somewhat like what they used to call in the ancient days (1950s) “associative memory”. Rather than specify an object (the value) by its location, we give it a name (the key) and the system can find it.

Here’s an interview with Ryan King, Twitter’s Director of Storage, on why Twitter chose Cassandra.

Hadoop and Pig

Hadoop is another highly-scalable distributed tool. Hadoop primarily implements the MapReduce operation. MapReduce is a way of applying massive processing power to massive datasets. The concepts behind MapReduce originated in the early 1960s or even earlier, during the development of the Lisp programming language. An implementation of MapReduce has been patented by Google.

Pig is a “scripting language” designed to work with Hadoop. It simplifies the programming tasks of people working with large datasets.

Summary

While the technologies are interesting to technologists like me, what does such massive power give the Twitter user? And why does Twitter need it? Here’s a simple example: The exact arrival rates of tweets at Twitter aren’t widely publicized. I don’t know if this is a “trade secret” or not – I’ve seen estimates of these rates on blogs but I’m not sure that the way those estimates were obtained is technically valid.

However, there is a publicly-available subset of the full “Firehose” data stream available via the Streaming API, called “Sample”. There’s no official documentation on what fraction of the full Firehose comes through the Sample stream. But in a sample I collected in January, I saw a peak of 81,718 tweets in a single hour!

And what was so special about that hour? It was, to be precise, the hour between 01:00:00 and 02:00:00 UTC January 13th, 2010. The Haiti earthquake happened at 21:53:10 UTC on January 12th, 2010 – about three hours earlier. Remember – “Sample”, as the name implies, is a subset of the full tweet stream! That’s the reason Twitter needs the massive power it is getting from these cutting-edge technologies.

Update!

Todd Hoff of Highscalability.com has just published a more detailed analysis of Twitter’s use of Hadoop and Pig, including links to a presentation by Kevin Weil, Analytics Lead at Twitter. Both are highly recommended!

Books on the Technologies

Hadoop in Action
by Chuck Lam
Powells.com
Pro Hadoop
by Jason Venner
Powells.com
Programming in Scala
by Martin Odersky
Powells.com
Programming Scala
by Dean Wampler
Powells.com
Beginning Scala
by David Pollak
Powells.com

  24 Responses to “A Peek Under Twitter’s Hood (Updated!)”

  1. RT @znmeb: A Peek Under Twitter’s Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ

  2. RT @znmeb: A Peek Under Twitter’s Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ

  3. RT .@znmeb: A Peek Under Twitter's Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ

  4. RT @znmeb: A Peek Under Twitter’s Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ

  5. A Peek Under Twitter’s Hood (Updated!) – Borasky Research Journal http://meb.tw/bTjysZ

  6. [...] a previous post, I focused on the technologies under the hood, so I want to focus on the people who have built this phenomenon – people we developers work [...]

  7. RT @znmeb: RT @DZone "A Peek Under Twitter’s Hood" http://dzone.com/Rmkl

  8. RT @DZone “A Peek Under Twitter’s Hood” http://borasky-research.net/2010/02/17/a...

  9. RT @znmeb: RT @TopsyRT: A Peek Under Twitter's Hood (Updated!) http://bit.ly/adDOM1

  10. RT @TopsyRT: A Peek Under Twitter's Hood (Updated!) http://bit.ly/adDOM1

  11. A Peek Under Twitter’s Hood (Updated!) | Borasky Research Journal http://borasky-research.net/2010/02/17/a...

  12. RT @znmeb: A Peek Under Twitter’s Hood (Updated!) | Borasky Research Journal http://meb.tw/bTjysZ

  13. A Peek Under Twitter’s Hood (Updated!) | Borasky Research Journal http://meb.tw/bTjysZ

  14. A Peek Under Twitter's Hood http://is.gd/8P17g

  15. RT @znmeb: A Peek Under Twitter’s Hood (Updated!) « Borasky Research Journal http://meb.tw/bTjysZ

  16. RT @znmeb: A Peek Under Twitter’s Hood (Updated!) « Borasky Research Journal http://meb.tw/bTjysZ

  17. A Peek Under Twitter’s Hood (Updated!) « Borasky Research Journal http://borasky-research.net/2010/02/17/a...

  18. RT @DZone “A Peek Under Twitter’s Hood” http://borasky-research.net/2010/02/17/a...

   
© 2011 Borasky Research Journal Suffusion theme by Sayontan Sinha