Borasky Research Journal Google+ Page
Borasky Research Journal Amazon Store
 

In response to yesterday’s earthquake in Chile, the good folks at EPIC Colorado have adapted their Tweak-the-Tweet syntax to both English and Spanish.

“Project EPIC has created a set of prescriptive tweets using the Tweak the Tweet syntax for the Chile earthquake. You can find both the English and Spanish versions of these tweets in this Google document (http://bit.ly/9psDqd).

Please help us by re-tweeting these from our @epiccolorado and @TtT_Pacific Twitter accounts. We would like to have response and relief organizations like the Red Cross, World Bank, or other appropriate organizations retweet these prescriptive tweets so that more people will pick up the syntax and make it useful. Let me know if you have any ideas or if you can help with this effort. ”

You can also follow @sophiabliu for more information.

 

Data Journalism Developer Studio 2012LX Blog


Yesterday, Evan Weaver tweeted “Twitter open source page is live! http://twitter.com/about/opensource”. This page is a fascinating peek under Twitter’s hood – the cutting edge open source technologies that power the popular microblogging service. For those of us who work with Twitter, this is required reading for career management and lifelong learning. And for those of us who are Twitter users, it’s a fascinating look at the future of the real-time web.

Ruby

As you may know, Twitter was originally a Ruby on Rails application. That’s actually where I first heard of Twitter – at RubyConf 2006. Early in 2007, I joined Twitter, and my first friends and followers were people I had met at RubyConf 2006.

As you’ll see below, Twitter has now incorporated many other technologies, but they still use Rails, and Ruby. In particular, the version of Ruby they use is Ruby Enterprise Edition (REE), a version tuned for stability and scalability.

Scala

Scala “is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way. It smoothly integrates features of object-oriented and functional languages, enabling Java and other programmers to be more productive. Code sizes are typically reduced by a factor of two to three when compared to an equivalent Java application.”

It’s not just Twitter that’s using Scala; Foursquare also uses it. Because of the high visibility of Twitter and Foursquare, I expect to hear a lot more about Scala in the coming months. For those of you in the Portland, Oregon area, there is now a Scala programmers’ group, @PDXScala.

Cassandra

Cassandra is one of the newer “Non-SQL” databases. It was originally developed at Facebook, and released as open source in 2008. The description on the Cassandra home page reads, “A highly scalable, eventually consistent, distributed, structured key-value store.” The key term (pun intended) here is “key-value store”. This is somewhat like what they used to call in the ancient days (1950s) “associative memory”. Rather than specify an object (the value) by its location, we give it a name (the key) and the system can find it.

Here’s an interview with Ryan King, Twitter’s Director of Storage, on why Twitter chose Cassandra.

Hadoop and Pig

Hadoop is another highly-scalable distributed tool. Hadoop primarily implements the MapReduce operation. MapReduce is a way of applying massive processing power to massive datasets. The concepts behind MapReduce originated in the early 1960s or even earlier, during the development of the Lisp programming language. An implementation of MapReduce has been patented by Google.

Pig is a “scripting language” designed to work with Hadoop. It simplifies the programming tasks of people working with large datasets.

Summary

While the technologies are interesting to technologists like me, what does such massive power give the Twitter user? And why does Twitter need it? Here’s a simple example: The exact arrival rates of tweets at Twitter aren’t widely publicized. I don’t know if this is a “trade secret” or not – I’ve seen estimates of these rates on blogs but I’m not sure that the way those estimates were obtained is technically valid.

However, there is a publicly-available subset of the full “Firehose” data stream available via the Streaming API, called “Sample”. There’s no official documentation on what fraction of the full Firehose comes through the Sample stream. But in a sample I collected in January, I saw a peak of 81,718 tweets in a single hour!

And what was so special about that hour? It was, to be precise, the hour between 01:00:00 and 02:00:00 UTC January 13th, 2010. The Haiti earthquake happened at 21:53:10 UTC on January 12th, 2010 – about three hours earlier. Remember – “Sample”, as the name implies, is a subset of the full tweet stream! That’s the reason Twitter needs the massive power it is getting from these cutting-edge technologies.

Update!

Todd Hoff of Highscalability.com has just published a more detailed analysis of Twitter’s use of Hadoop and Pig, including links to a presentation by Kevin Weil, Analytics Lead at Twitter. Both are highly recommended!

Books on the Technologies

Hadoop in Action
by Chuck Lam
Powells.com
Pro Hadoop
by Jason Venner
Powells.com
Programming in Scala
by Martin Odersky
Powells.com
Programming Scala
by Dean Wampler
Powells.com
Beginning Scala
by David Pollak
Powells.com
 

Download the AlgoCompSynth Virtual Appliance


Follow @AlgoCompSynth On Twitter!


Algorithmic Composition: Paradigms of Automated Music Generation

If you want to get started in algorithmic composition, this book is the best place to start. The author, Gerhard Niehaus, teaches algorithmic composition at the Institute of Electronic Music (IEM) at the University of Music and Dramatic Arts in Graz, Austria. His goal was to create a “detailed overview of prominent procedures of algorithmic composition in a pragmatic way,” and I think he has succeeded admirably. As you may know, algorithmic composition has a long history, featuring some names you might recognize – Kirnberger, Mozart and C.P.E. Bach. Niehaus recaps the history of algorithmic composition and algorithms in general in chapter 2. Following this introduction, Niehaus devotes a chapter to each of the major algorithmic composition paradigms:

  • Markov Models
  • Generative Grammars
  • Transition Networks
  • Chaos and Self-Similarity
  • Genetic Algorithms
  • Cellular Automata
  • Artificial Neural Networks
  • Artificial Intelligence

A final synopsis completes the book. My own compositions have mostly used Markov models, primarily as inspired by Xenakis’ Formalized Music. I don’t think I’m unique in staying with a single paradigm – most other algorithmic composers that I’m aware of have tended to focus on a single one of these paradigms, choosing to exploit other means to keep their music interesting. But I’ve certainly grown tired of the limitations of Markov models, and found several other techniques I can use in this book. I would have liked to see more on sonification and on “found music”, which I regard as algorithmic composition paradigms on an equal footing with the ones Niehaus covers. My personal favorite piece of my own was in fact “found music” – recordings taken from the sounds computers used to make as a by-product of� earning their keep. And I think fuzzy logic should have had more coverage as well – Peter Elsea has done quite a bit of research in this technique that deserves to be better known. Algorithmic composition certainly isn’t for every composer – it requires a disposition towards music theory that some composers can do without. But if you’re a musician / composer with a theoretical bent, I encourage you to try it, and this book is the best place to start.

AlgoCompSynth Reading List

 

 

As you probably know, I’m a big fan of ViralHeat. I’ve been using ViralHeat for social media monitoring since their first public release, about seven months ago. So what’s changed?

  • ViralHeat now monitors Facebook. I’m not a big user of Facebook as a promotion tool, but it’s a place where people do talk about you, so it’s a place you need to be listening.
  • The dashboard is much improved. You now get an overview of mentions in all your profiles, and each profile has a dashboard with a Profile Dashboard, a Microblogging tab, a Facebook tab, a Realtime Web tab and a Video tab.
  • The pricing structure has changed. The Basic account is still $9.95 a month, but the next level up, Professional, has been reduced to $29.99 a month. For this account, you get up to 20 profiles, full viral content analytics, sentiment analysis and 500 API calls per day.

As a developer, API access is important to me. The API allows automating� anything the web application can do, such as adding and deleting profiles. But it also allows retrieval of statistics and trends, and even the raw data! Having this capability is a big time saver for me. Twitter data is easy to get – I routinely collect Twitter data using the Twitter API. But being able to get at the raw data from other web properties – Facebook, Realtime Web and Video – is a huge improvement.

So if you haven’t explored ViralHeat yet, I highly recommend trying it out. One final note – ViralHeat is free for non-profit organizations. For more information, head over to

http://bit.ly/b3AKJm

 

This tutorial covers profiling of Linux servers using open-source tools such as “iostat”, “oprofile” and “blktrace”. Both processor-bound and I/O-bound cases are covered, and the emphasis is on tools that provide visual displays of relevant metrics.

Linux Server Profiling: Using Open Source Tools For Bottleneck Analysis

 

If you’re a web analytics aficionado, you know that most analytics tools, including Clicky, give you statistics on which browsers your visitors are using. Through the magic of Clicky’s real-time analysis, you can see this for my web site for the past 90 days. I’m opening this blog post up for comments – the question is, “What the Heck is Happening to Internet Explorer?”

Clicky Web Analytics

 

Download the Data Journalism Developer Studio from SUSE Gallery!


Thanks to Drew Conway (@drewconway), a PhD student at New York University, there are now eight excellent video tutorials on using the R language up on VCASMO. I think I should do a blog post about how magical VCASMO is and why they should be eating SlideShare’s lunch, but for now, I’ll just say

  • The presentations there are synchronized video and slides, and
  • They have an API.
  1. MATLAB/R Dictionary (Rosetta Stone Talk – 1/3)
    Presentation given by Harlan Harris at the NYC R Statistical Meetup on January 7, 2010.
  2. Learning R via Python…or the other way around (Rosetta Stone Talk – 2/3)
    Presentation given by Drew Conway at the NYC R Statistical Meetup on January 7, 2010.
  3. Data munging with SQL and R (Rosetta Stone Talk – 3/3)
    Presentation given by Josh Reich at the NYC R Statistical Meetup on January 7, 2010.
  4. Use Rapache: It Works!

    Presentations given at the Bay Area useR Group on January 10, 2010 by Jeff Horner, creator of the Rapache module, on “R-driven web applications”.
  5. Web Development with R
    Presentations given at the Bay Area useR Group on January 10, 2010 by Jeroen Ooms, on how to create web applications use R.
  6. Soical Network Analysis in R

    Presentation by Drew Conway on August 6, 2009 at the NYC R Statistical Programming Meetup on how to perform basic social network analysis in R using the igraph package.
  7. Introduction to the Grammar of Graphics with ggplot2 in R

    A detailed introduction to the Grammar of Graphics as implemented in R with the data visualization library ggplot2. This talk was given by Harlan Harris to the NYC R Statistical Meetup on December 3, 2009.
  8. Visualizing Data in R with ggplot2

    Drew Conway presents a brief talk on how to visualize data in R with ggplot2 at the NYC R Statistical Meetup on December 3, 2009.

In case you missed them, here is the Powell’s link for the ggplot2 book:

 

Data Journalism Developer Studio 2012LX Blog


Disclosure

As you probably know, I live in the Portland, Oregon area and have for many years. One of the must-visit places here is Powell’s Books. The book links in this post will all take you to Powell’s as part of their Partner Program. If you’d like to join the program too, here’s the link.

Updated September 11, 2011: I recently purchased a Kindle, and two of these books are now available in that format. For those two, I’ve tweeted my Amazon Affiliate links out and have embedded those tweets here.

Prerequisite Software

To get the most out of these books, you will need to install some software. You will need Mondrian, R, GGobi, and the ggplot2, rggobi and DescribeDisplay R packages. All of these will run on a Windows, Macintosh or Linux desktop / laptop, including most netbooks. And they are all free, open source software. An easy way to get them all, packaged in an openSUSE Linux appliance, is to download Data Journalism Developer Studio 2012.

Ggplot2: Elegant Graphics for Data Analysis (Use R)

ggplot2 is an advanced graphics package for the R programming language. It is based on the grammar of graphics (Grammar of Graphics 2ND Edition). ggplot2 generates the most beautiful static graphics I’ve ever seen. You can use ggplot2 at any stage of your analysis. Simple exploratory plots can be made with a simple call to the “qplot” function, and when you’re ready to create a final report or presentation, you can get publication-quality graphics.

The two things I like most about the ggplot2 package are

  • The absolutely stunning visual appeal of the plots it produces: Dr. Wickham has paid great attention to the visual aspects of the output. I don’t know of another package in any language that generates such beautiful plots.
  • The numerous built-in analysis methods: Boxplots, kernel and quantile regression and smoothing, faceted plots – all are “standard equipment” with ggplot2.
ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham http://t.co/GfXsOJm via @
@znmeb
M. Edward Borasky

Interactive Graphics for Data Analysis: Principles and Examples

This book is a complete course in interactive graphics for data analysis. It is mostly based on the Mondrian interactive statistical data visualization system, although there is some use of R as well. The first part covers the basic tools, and the second part gives case studies.

The case studies really are the best part of the book. They cover geographical analysis, some interesting history from the sinking of the Titanic and the 2004 Florida election. As I note below, there is some overlap in tools between Mondrian and GGobi, but you really need both books and both packages to be able to do everything.

Interactive Graphics for Data Analysis: Principles and Examples (Chapman & Hall/CRC C... by Martin Theus http://t.co/6XbdVWU via @
@znmeb
M. Edward Borasky

Interactive and Dynamic Graphics for Data Analysis: With R and Ggobi (Use R)

As the title implies, this book is also a complete course in data analysis using interactive graphics. But the focus here is on R and GGobi rather than Mondrian. While there is some overlap in the tools, there are some things Mondrian does that GGobi doesn’t do, and vice versa. A partial list:

  • Geographic datasets: Mondrian only
  • Mosaic plots: Mondrian only
  • Classification: GGobi only
  • Clustering: GGobi only
  • Social network graphs: GGobi only

In addition, GGobi integrates directly with R and ggplot2 via the rggobi and DescribeDisplay packages. There are some integration points between R and Mondrian, but that integration isn’t as tight as it is with R, GGobi and ggplot2.

Interactive and Dynamic Graphics for Data Analysis: With R and GGobi (Use R) by Dianne Cook http://t.co/TxTU5ov via @
@znmeb
M. Edward Borasky
© 2011 Borasky Research Journal Suffusion theme by Sayontan Sinha