May 142012
 

If you’ve been following my tweet stream, you saw me tweet this:

At $1450 a month for five seats, I think the service is overpriced. Moreover, Twitter, Facebook/Instagram, Google/YouTube or Yahoo/Flickr could easily build this into their web sites and deliver it for free, essentially by-passing two middlemen – Geofeedia and the news organization subscribing to Geofeedia. And a clever RSS / Yahoo! Pipes hacker could build something like this for use in a newsroom. For that matter, if you limit yourself to Twitter you can do most of this with Twitter / Advanced Search.

I must admit that I love the idea and think this could evolve into something game-changing. I wrote about the potential for this back in January 2010!

The Twitter Streaming API — How It Works and Why It’s A Big Deal

To get an idea what this could become, check out Knowledge Discovery from Data Streams by Joao Gama.

Moving on, I don’t know how I’ve managed to be a tech blogger writing about computational journalism without discovering Overview until last week, but it happened. Twitter serendipity at work – I was watching my Interactions page and saw a tweet of mine retweeted by @overviewproject. The Overview project is led by Jonathan Stray. You can see the entire team here.

Overview is open source, lives on Github and appears to be a mix of Ruby and Java. I’m currently testing it out for potential inclusion in one of my computational journalism appliances. It’s a browser / desktop application, so most likely it will end up in the successor to Data Journalism Developer Studio  2012LX. If you want to work with it yourself, the instructions are here.

So which of the two represents the future of journalism? Both, of course! With the proper underlying database and real-time knowledge discovery algorithms, Geofeedia could be a game-changer. But in the long run, as a for-profit service, I think they’ll either get acquired or duplicated by the big players..The Overview project, on the other hand, is an open source project. It’s well-funded by the Knight Foundation and Associated Press, and the team is led by one of the well-known names in computational journalism. Overview is certainly going to be part of my future.

 

May 092012
 

Updated 2012-05-14 13:28 PDT – Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement has gone to print. Congratulations to Eric Redmond and Jim R. Wilson!


As I’ve noted before, I believe any discipline is defined by its practitioners and the tools they use, and computational journalism is no exception. I think computational journalism is a superset of data journalism, but certainly the terms “data journalism” and “data science” cover the bulk of the tools I have collected and used.

In any case, to do computational journalism, at least some data must be collected, stored, explored, analyzed, cleaned, managed and “governed.” In the past few years, the “traditional” tools for doing this, called relational database management systems (RDBMS), have been supplemented by a new class of tools broadly known as “NoSQL” databases. The name NoSQL comes from the most widely used language for dealing with a traditional RDBMS, SQL.

The NoSQL field is rapidly evolving, but enough knowledge exists to fill several books. The best overview of databases for computational journalists I’ve found comes from a soon-to-be-released work from Eric Redmond and Jim R. Wilson called Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement..

I’ve been working through the book, which has been available for a few months in beta from the publisher, Pragmatic Programmers, LLC, in the course of collecting the tools for Data Journalism Developer Studio 2012LX and Computational Journalism Server. My goal is to have all of the databases available in both appliances, although at the moment only PostgreSQL, MongoDB, CouchDB and Redis are available directly from SUSE Studio.

Seven Databases in Seven Weeks covers, in order:

  • PostgreSQL, a traditional RDBMS,
  • Riak, a key-value database
  • HBase, a columnar database
  • MongoDB, a document-oriented database
  • CouchDB, a document-oriented database,
  • Neo4j, a graph-oriented database, and
  • Redis, a key-value database / data structure server.

All of these databases are open source, and they’re all supported by either a corporate entitiy, a non-profit foundation, or some combination of the two. The title really should have been “Seven Databases in Seven Weekends”; each database is covered in three-day hands-on sessions and could easily be done as a series of weekend projects. The book is hands-on – you’ll build things with these databases, including a Node.js application combining Redis, CouchDB and Neo4j into an application that provides a “band information service.”

Appendix A contains a pair of tables that give an overview of the distinguishing characteristics of the seven databases. As the authors put it, “Although the tables are not a replacement for a true understanding, they should provide you with an at-a-glance sense of what each database is capable of, where it falls short, and how it fits into the modern database landscape.”

I believe all of these databases have a place in modern computational journalism, as do the other two well-known open source RDBMS tools, MySQL and SQLite. In particular, for spatial / mapping projects, PostgreSQL, SQLite, MongoDB and CouchDB have robust geographic information systems capabilities either built in or available as add-ons.

Riak, HBase, MongoDB and CouchDB all support “big data” applications implemented via MapReduce. MongoDB and CouchDB both store their documents as JavaScript Object Notation (JSON) objects, which is the “native” format for Twitter data. Neo4j, as a graph database, is perfect for storing data about relationships, such as the interconnections between corporate executives and legislators. And because of its speed, Redis can serve as high-speed pipelines between other components in almost any application architecture.

I think NoSQL databases will be the core of computational journalism for the next few years. The RDBMS isn’t going away, of course, but if you limit yourself to “SQL thinking” or even “object-relational models” and “model-view-controller” architectures, there will be applications you can’t build. This book will get you up to speed as fast as you’re willing to go.

May 082012
 

There are quite a few books out now on “data science”. I’ve picked out three that I think are the best place to start for computational journalists. First is Machine Learning for Hackers, by Drew Conway and John Myles White. The authors are frequent contributors to the #rstats hashtag; R is the “native language” for this book. Topics of interest to computational journalists include

  • Email processing using Bayesian classifiers to detect spam,
  • An analysis of “polarization” in roll call votes in the United States Senate, and
  • Building a Twitter follower recommendation system.

I’m in the final stages of releasing version 1.2.5 of the Computational Journalism Server, and one of the goals is for the examples in this book to run.

Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites, by Matthew A. Russell, uses the Python language rather than R. This might make it more accessible to computational journalists. The book includes

  • scripts for accessing Facebook, Twitter, LinkedIn and Google+ APIs,
  • an excellent explanation of the basics of statistical natural language processing,
  • tools for building HTML5 / JavaScript visualizations, and
  • tools for exploring microformats.

This is more of a workstation resource than a server resource, so I’d recommend downloading Data Journalism Developer Studio 2012LX to experiment with these tools.

Finally, for those of you working in the area of politics and social media, I highly recommend Social Network Analysis for Startups: Finding connections on the social web by Maksim Tsvetovat and Alexander Kouznetsov. Like the previous book, the “native language” here is Python. Some of the topics and tools are also covered in Mining the Social Web, but there’s more depth here. You’ll also want to review the free webinars from O’Reilly on the subjects in the book.

 

May 072012
 

As noted yesterday, this blog is mostly about my hobbies. Yesterday was also the second anniversary of the 2010 Flash Crash, and since one of my hobbies is computational finance, I’m listing what I think are the five best trading books of all time – as of yesterday, of course. If you’re interested in the topic of algorithmic trading and market microstructure, there’s quite a flurry of activity because of the Flash Crash anniversary, and I’m curating this Scoop.it topic on the subject.

The oldest book on my list, first published in 1922, is Reminiscences of a Stock Operator, by Edwin LeFevre. If you read this carefully, you’ll find that many of “Larry Livingston’s” methods differ only in the speed and mechanics of execution from those of today’s algorithmic traders. And this edition, published in late 2009, is updated with copious notes on the financial crisis of 2007 – 2008 by the editor and by Paul Tudor Jones.

The newest book on my list, from 2011, is Financial Markets and Trading: An Introduction to Market Microstructure and Trading Strategies, by Anatoly B. Schmidt. I’ve covered this book at length here and here, so I won’t say any more about it.

The remaining three books on my list are classics. First, Perry Kaufman’s New Trading Systems and Methods, Fourth Edition. This is the most comprehensive description of trading systems design and testing available. There are other books on indicators, charting, money management, fundamentals and testing, but I’ve not found another that covers the broad range as well as this one. Many of Kaufman’s own contributions are documented here – dealing with price shocks, adaptive moving averages and his work on point and figure charting with Kermit Zeig.

Along similar lines, but more focused on technical analysis and futures trading is Technical Traders Guide to Analysis of the Futures Markets, by Charles Lebeau and David Lucas. Lebeau and Lucas dissect a number of popular trading strategies and devise a systematic methodology for testing and managing them. They also cover point and figure charting, though not in as much detail as Kaufman.

I had trouble deciding which of the following two books to pick for number five, so I’m going to present them both. First is The Encyclopedia of Trading Strategies by Jeffrey Katz and Donna McCormick. Katz and McCormick performed a massive search of strategies on real data and documented which worked (very few) and what didn’t (most of them). The code they used is available from them, although given how few strategies worked, I’m not sure it’s worth repeating the studies on more recent data. But the message is clear, from this book and the others on this list – not a lot works, nothing always works, and it’s easy to make mistakes that invalidate the analysis.

The sixth book on my list of five is Saleh N. Neftci’s An Introduction to the Mathematics of Financial Derivatives, Second Edition. This was originally intended as a graduate textbook, but it’s readable by undergraduates and even traders looking to understand the basis of quantitative finance.

May 032012
 

A few weeks ago, I volunteered for a Wikipedia editing hack session. In the course of the session, I browsed by the page on Computational Journalism. It’s quite sparse, and as a result, I decided to collect my thoughts on exactly what computational journalism is. I’m still collecting – for me, any discipline is defined by the practitioners, what they do and the tools they use. But I’ve collected a few articles and books that I think are a good place to start.

I’d recommend starting with this article: Computational Journalism: How computer scientists can empower journalists, democracy’s watchdogs, in the production of news in the public interest.

Key Insights

  • The public-interest journalism on which democracy depends is under enormous financial and technological pressure.
  • Computer scientists help journalists cope with these pressures by developing new interfaces, indexing algorithms, and data-extraction techniques.
  • For public-interest journalism to thrive, computer scientists and journalists must work together, with each learning elements of the other’s trade.

Although it’s quite brief, this article defines well the frontiers of computational journalism. In particular, the authors call out five “areas of opportunity”:

  1. Combining information from varied digital sources.
  2. Information extraction.
  3. Document exploration and redundancy.
  4. Audio and video indexing.
  5. Extracting data from forms and reports.

With that article as a basis, I’d recommend the recently-published Data Journalism Handbook. While I consider data journalism a proper subset of computational journalism, the concepts are better known, so it’s a good place to start the journey. It’s also quite comprehensive and down-to-earth. And it’s free.

Participatory Journalism is a collection of essays covering the emerging trends of social media, user-generated content and the new ways journalists interact with their audiences. It’s based mostly on interviews with journalists in major newspapers around the world. I found it rather wordy and overly long, but there really isn’t any other book I’ve found that describes these trends from the point of view of people actually working in newsrooms. And their challenges feed right back into the areas of opportunity called out in the ACM article.

Finally, for those of you looking for a foundation course in journalism, I highly recommend Andy Bull’s Multimedia Journalism. Once you buy the textbook, you get access to the entire web site. It’s been many years since I took a journalism course, and I found that reading the textbook was vital for me as a developer to understand what journalists need from the tools and the technologies.