May 082012
 

There are quite a few books out now on “data science”. I’ve picked out three that I think are the best place to start for computational journalists. First is Machine Learning for Hackers, by Drew Conway and John Myles White. The authors are frequent contributors to the #rstats hashtag; R is the “native language” for this book. Topics of interest to computational journalists include

  • Email processing using Bayesian classifiers to detect spam,
  • An analysis of “polarization” in roll call votes in the United States Senate, and
  • Building a Twitter follower recommendation system.

I’m in the final stages of releasing version 1.2.5 of the Computational Journalism Server, and one of the goals is for the examples in this book to run.

Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites, by Matthew A. Russell, uses the Python language rather than R. This might make it more accessible to computational journalists. The book includes

  • scripts for accessing Facebook, Twitter, LinkedIn and Google+ APIs,
  • an excellent explanation of the basics of statistical natural language processing,
  • tools for building HTML5 / JavaScript visualizations, and
  • tools for exploring microformats.

This is more of a workstation resource than a server resource, so I’d recommend downloading Data Journalism Developer Studio 2012LX to experiment with these tools.

Finally, for those of you working in the area of politics and social media, I highly recommend Social Network Analysis for Startups: Finding connections on the social web by Maksim Tsvetovat and Alexander Kouznetsov. Like the previous book, the “native language” here is Python. Some of the topics and tools are also covered in Mining the Social Web, but there’s more depth here. You’ll also want to review the free webinars from O’Reilly on the subjects in the book.