Feb 172010
 

Yesterday, Evan Weaver tweeted “Twitter open source page is live! http://twitter.com/about/opensource”. This page is a fascinating peek under Twitter’s hood – the cutting edge open source technologies that power the popular microblogging service. For those of us who work with Twitter, this is required reading for career management and lifelong learning. And for those of us who are Twitter users, it’s a fascinating look at the future of the real-time web.

Ruby

As you may know, Twitter was originally a Ruby on Rails application. That’s actually where I first heard of Twitter – at RubyConf 2006. Early in 2007, I joined Twitter, and my first friends and followers were people I had met at RubyConf 2006.

As you’ll see below, Twitter has now incorporated many other technologies, but they still use Rails, and Ruby. In particular, the version of Ruby they use is Ruby Enterprise Edition (REE), a version tuned for stability and scalability.

Scala

Scala “is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way. It smoothly integrates features of object-oriented and functional languages, enabling Java and other programmers to be more productive. Code sizes are typically reduced by a factor of two to three when compared to an equivalent Java application.”

It’s not just Twitter that’s using Scala; Foursquare also uses it. Because of the high visibility of Twitter and Foursquare, I expect to hear a lot more about Scala in the coming months. For those of you in the Portland, Oregon area, there is now a Scala programmers’ group, @PDXScala.

Cassandra

Cassandra is one of the newer “Non-SQL” databases. It was originally developed at Facebook, and released as open source in 2008. The description on the Cassandra home page reads, “A highly scalable, eventually consistent, distributed, structured key-value store.” The key term (pun intended) here is “key-value store”. This is somewhat like what they used to call in the ancient days (1950s) “associative memory”. Rather than specify an object (the value) by its location, we give it a name (the key) and the system can find it.

Here’s an interview with Ryan King, Twitter’s Director of Storage, on why Twitter chose Cassandra.

Hadoop and Pig

Hadoop is another highly-scalable distributed tool. Hadoop primarily implements the MapReduce operation. MapReduce is a way of applying massive processing power to massive datasets. The concepts behind MapReduce originated in the early 1960s or even earlier, during the development of the Lisp programming language. An implementation of MapReduce has been patented by Google.

Pig is a “scripting language” designed to work with Hadoop. It simplifies the programming tasks of people working with large datasets.

Summary

While the technologies are interesting to technologists like me, what does such massive power give the Twitter user? And why does Twitter need it? Here’s a simple example: The exact arrival rates of tweets at Twitter aren’t widely publicized. I don’t know if this is a “trade secret” or not – I’ve seen estimates of these rates on blogs but I’m not sure that the way those estimates were obtained is technically valid.

However, there is a publicly-available subset of the full “Firehose” data stream available via the Streaming API, called “Sample”. There’s no official documentation on what fraction of the full Firehose comes through the Sample stream. But in a sample I collected in January, I saw a peak of 81,718 tweets in a single hour!

And what was so special about that hour? It was, to be precise, the hour between 01:00:00 and 02:00:00 UTC January 13th, 2010. The Haiti earthquake happened at 21:53:10 UTC on January 12th, 2010 – about three hours earlier. Remember – “Sample”, as the name implies, is a subset of the full tweet stream! That’s the reason Twitter needs the massive power it is getting from these cutting-edge technologies.

Update!

Todd Hoff of Highscalability.com has just published a more detailed analysis of Twitter’s use of Hadoop and Pig, including links to a presentation by Kevin Weil, Analytics Lead at Twitter. Both are highly recommended!

Books on the Technologies

Hadoop in Action
by Chuck Lam
Powells.com
Pro Hadoop
by Jason Venner
Powells.com
Programming in Scala
by Martin Odersky
Powells.com
Programming Scala
by Dean Wampler
Powells.com
Beginning Scala
by David Pollak
Powells.com
 Posted by at 21:00
Feb 152010
 

Download the AlgoCompSynth Virtual Appliance


Follow @AlgoCompSynth On Twitter!


Algorithmic Composition: Paradigms of Automated Music Generation

If you want to get started in algorithmic composition, this book is the best place to start. The author, Gerhard Niehaus, teaches algorithmic composition at the Institute of Electronic Music (IEM) at the University of Music and Dramatic Arts in Graz, Austria. His goal was to create a “detailed overview of prominent procedures of algorithmic composition in a pragmatic way,” and I think he has succeeded admirably. As you may know, algorithmic composition has a long history, featuring some names you might recognize – Kirnberger, Mozart and C.P.E. Bach. Niehaus recaps the history of algorithmic composition and algorithms in general in chapter 2. Following this introduction, Niehaus devotes a chapter to each of the major algorithmic composition paradigms:

  • Markov Models
  • Generative Grammars
  • Transition Networks
  • Chaos and Self-Similarity
  • Genetic Algorithms
  • Cellular Automata
  • Artificial Neural Networks
  • Artificial Intelligence

A final synopsis completes the book. My own compositions have mostly used Markov models, primarily as inspired by Xenakis’ Formalized Music. I don’t think I’m unique in staying with a single paradigm – most other algorithmic composers that I’m aware of have tended to focus on a single one of these paradigms, choosing to exploit other means to keep their music interesting. But I’ve certainly grown tired of the limitations of Markov models, and found several other techniques I can use in this book. I would have liked to see more on sonification and on “found music”, which I regard as algorithmic composition paradigms on an equal footing with the ones Niehaus covers. My personal favorite piece of my own was in fact “found music” – recordings taken from the sounds computers used to make as a by-product of� earning their keep. And I think fuzzy logic should have had more coverage as well – Peter Elsea has done quite a bit of research in this technique that deserves to be better known. Algorithmic composition certainly isn’t for every composer – it requires a disposition towards music theory that some composers can do without. But if you’re a musician / composer with a theoretical bent, I encourage you to try it, and this book is the best place to start.

AlgoCompSynth Reading List

 

 Posted by at 15:24
Feb 042010
 

Download the Data Journalism Developer Studio from SUSE Gallery!


Thanks to Drew Conway (@drewconway), a PhD student at New York University, there are now eight excellent video tutorials on using the R language up on VCASMO. I think I should do a blog post about how magical VCASMO is and why they should be eating SlideShare’s lunch, but for now, I’ll just say

  • The presentations there are synchronized video and slides, and
  • They have an API.
  1. MATLAB/R Dictionary (Rosetta Stone Talk – 1/3)
    Presentation given by Harlan Harris at the NYC R Statistical Meetup on January 7, 2010.
  2. Learning R via Python…or the other way around (Rosetta Stone Talk – 2/3)
    Presentation given by Drew Conway at the NYC R Statistical Meetup on January 7, 2010.
  3. Data munging with SQL and R (Rosetta Stone Talk – 3/3)
    Presentation given by Josh Reich at the NYC R Statistical Meetup on January 7, 2010.
  4. Use Rapache: It Works!

    Presentations given at the Bay Area useR Group on January 10, 2010 by Jeff Horner, creator of the Rapache module, on “R-driven web applications”.
  5. Web Development with R
    Presentations given at the Bay Area useR Group on January 10, 2010 by Jeroen Ooms, on how to create web applications use R.
  6. Soical Network Analysis in R

    Presentation by Drew Conway on August 6, 2009 at the NYC R Statistical Programming Meetup on how to perform basic social network analysis in R using the igraph package.
  7. Introduction to the Grammar of Graphics with ggplot2 in R

    A detailed introduction to the Grammar of Graphics as implemented in R with the data visualization library ggplot2. This talk was given by Harlan Harris to the NYC R Statistical Meetup on December 3, 2009.
  8. Visualizing Data in R with ggplot2

    Drew Conway presents a brief talk on how to visualize data in R with ggplot2 at the NYC R Statistical Meetup on December 3, 2009.

In case you missed them, here is the Powell’s link for the ggplot2 book:

 Posted by at 12:09
Feb 032010
 

Data Journalism Developer Studio 2012LX Blog


Disclosure

As you probably know, I live in the Portland, Oregon area and have for many years. One of the must-visit places here is Powell’s Books. The book links in this post will all take you to Powell’s as part of their Partner Program. If you’d like to join the program too, here’s the link.

Updated September 11, 2011: I recently purchased a Kindle, and two of these books are now available in that format. For those two, I’ve tweeted my Amazon Affiliate links out and have embedded those tweets here.

Prerequisite Software

To get the most out of these books, you will need to install some software. You will need Mondrian, R, GGobi, and the ggplot2, rggobi and DescribeDisplay R packages. All of these will run on a Windows, Macintosh or Linux desktop / laptop, including most netbooks. And they are all free, open source software. An easy way to get them all, packaged in an openSUSE Linux appliance, is to download Data Journalism Developer Studio 2012.

Ggplot2: Elegant Graphics for Data Analysis (Use R)

ggplot2 is an advanced graphics package for the R programming language. It is based on the grammar of graphics (Grammar of Graphics 2ND Edition). ggplot2 generates the most beautiful static graphics I’ve ever seen. You can use ggplot2 at any stage of your analysis. Simple exploratory plots can be made with a simple call to the “qplot” function, and when you’re ready to create a final report or presentation, you can get publication-quality graphics.

The two things I like most about the ggplot2 package are

  • The absolutely stunning visual appeal of the plots it produces: Dr. Wickham has paid great attention to the visual aspects of the output. I don’t know of another package in any language that generates such beautiful plots.
  • The numerous built-in analysis methods: Boxplots, kernel and quantile regression and smoothing, faceted plots – all are “standard equipment” with ggplot2.
ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham http://t.co/GfXsOJm via @
@znmeb
M. Edward Borasky

Interactive Graphics for Data Analysis: Principles and Examples

This book is a complete course in interactive graphics for data analysis. It is mostly based on the Mondrian interactive statistical data visualization system, although there is some use of R as well. The first part covers the basic tools, and the second part gives case studies.

The case studies really are the best part of the book. They cover geographical analysis, some interesting history from the sinking of the Titanic and the 2004 Florida election. As I note below, there is some overlap in tools between Mondrian and GGobi, but you really need both books and both packages to be able to do everything.

Interactive Graphics for Data Analysis: Principles and Examples (Chapman & Hall/CRC C... by Martin Theus http://t.co/6XbdVWU via @
@znmeb
M. Edward Borasky

Interactive and Dynamic Graphics for Data Analysis: With R and Ggobi (Use R)

As the title implies, this book is also a complete course in data analysis using interactive graphics. But the focus here is on R and GGobi rather than Mondrian. While there is some overlap in the tools, there are some things Mondrian does that GGobi doesn’t do, and vice versa. A partial list:

  • Geographic datasets: Mondrian only
  • Mosaic plots: Mondrian only
  • Classification: GGobi only
  • Clustering: GGobi only
  • Social network graphs: GGobi only

In addition, GGobi integrates directly with R and ggplot2 via the rggobi and DescribeDisplay packages. There are some integration points between R and Mondrian, but that integration isn’t as tight as it is with R, GGobi and ggplot2.

Interactive and Dynamic Graphics for Data Analysis: With R and GGobi (Use R) by Dianne Cook http://t.co/TxTU5ov via @
@znmeb
M. Edward Borasky
 Posted by at 13:25