May 152012
 

Surely you’ve seen this: I got one of those emails. As far as I’m concerned it’s spam and I’ll be disabling it.It is not

  • Timely – what good is a summary of last week’s tweets?
  • Revelant – I don’t see any evidence that it tracks the topics I care about. And, of course, if I did care about a topic, I’d be tracking it on Twitter via search, in my RSS feed reader, via search engines, on email lists and even in face-to-face meetings.
  • Personal – It’s a damn email autoresponder, fercryingoutloud! Sure, it knows my name and Twitter handle, just like every other email autoresponder I’ve ever joined.

I think this is a giant leap backwards for Twitter. Email marketing represents everything a lot of us hate about the Internet. It’s annoying and for the most part a waste of the senders’ time as well as the receivers. I’m on probably a dozen or two email lists / Google Groups relevant to my interests. But I rarely give out my email address any more to, say, download a “free white paper” or some other “content marketing” gizmo.

Twitter has an email list of hundreds of millions of addresses. How long do you suppose it will take phishers to copy the emails, hook up databases of Twitter handles and email addresses and start pumping out fake “best of Twitter” emails? How long do you suppose it will be before advertisers want “Promoted Stories” sent out to this mailing list? And if you’re using GMail to read these emails, well, Google is making advertising dollars on Twitter’s back! What’s up with that?

I haven’t seen many complaints about this so far in the tech blogs. I think the focus is on Facebook’s IPO and Yahoo’s attempting to fire its way to growth. But I think it’s a bad idea, and I’ve unsubscribed.

If I were working at Twitter, I’d go the exact opposite way. Instead of building a “weekly news magazine”, I’d build a breaking real-time world news ticker. Give me a page with a map of the world. Capture an average of tweet rates by geotag, time of day and day of the week in a database. When the tweet rate takes a sharp increase at a location, light it up on the map and give me a link to search for the tweets.

May 142012
 

If you’ve been following my tweet stream, you saw me tweet this:

At $1450 a month for five seats, I think the service is overpriced. Moreover, Twitter, Facebook/Instagram, Google/YouTube or Yahoo/Flickr could easily build this into their web sites and deliver it for free, essentially by-passing two middlemen – Geofeedia and the news organization subscribing to Geofeedia. And a clever RSS / Yahoo! Pipes hacker could build something like this for use in a newsroom. For that matter, if you limit yourself to Twitter you can do most of this with Twitter / Advanced Search.

I must admit that I love the idea and think this could evolve into something game-changing. I wrote about the potential for this back in January 2010!

The Twitter Streaming API — How It Works and Why It’s A Big Deal

To get an idea what this could become, check out Knowledge Discovery from Data Streams by Joao Gama.

Moving on, I don’t know how I’ve managed to be a tech blogger writing about computational journalism without discovering Overview until last week, but it happened. Twitter serendipity at work – I was watching my Interactions page and saw a tweet of mine retweeted by @overviewproject. The Overview project is led by Jonathan Stray. You can see the entire team here.

Overview is open source, lives on Github and appears to be a mix of Ruby and Java. I’m currently testing it out for potential inclusion in one of my computational journalism appliances. It’s a browser / desktop application, so most likely it will end up in the successor to Data Journalism Developer Studio  2012LX. If you want to work with it yourself, the instructions are here.

So which of the two represents the future of journalism? Both, of course! With the proper underlying database and real-time knowledge discovery algorithms, Geofeedia could be a game-changer. But in the long run, as a for-profit service, I think they’ll either get acquired or duplicated by the big players..The Overview project, on the other hand, is an open source project. It’s well-funded by the Knight Foundation and Associated Press, and the team is led by one of the well-known names in computational journalism. Overview is certainly going to be part of my future.

 

May 092012
 

Updated 2012-05-14 13:28 PDT – Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement has gone to print. Congratulations to Eric Redmond and Jim R. Wilson!


As I’ve noted before, I believe any discipline is defined by its practitioners and the tools they use, and computational journalism is no exception. I think computational journalism is a superset of data journalism, but certainly the terms “data journalism” and “data science” cover the bulk of the tools I have collected and used.

In any case, to do computational journalism, at least some data must be collected, stored, explored, analyzed, cleaned, managed and “governed.” In the past few years, the “traditional” tools for doing this, called relational database management systems (RDBMS), have been supplemented by a new class of tools broadly known as “NoSQL” databases. The name NoSQL comes from the most widely used language for dealing with a traditional RDBMS, SQL.

The NoSQL field is rapidly evolving, but enough knowledge exists to fill several books. The best overview of databases for computational journalists I’ve found comes from a soon-to-be-released work from Eric Redmond and Jim R. Wilson called Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement..

I’ve been working through the book, which has been available for a few months in beta from the publisher, Pragmatic Programmers, LLC, in the course of collecting the tools for Data Journalism Developer Studio 2012LX and Computational Journalism Server. My goal is to have all of the databases available in both appliances, although at the moment only PostgreSQL, MongoDB, CouchDB and Redis are available directly from SUSE Studio.

Seven Databases in Seven Weeks covers, in order:

  • PostgreSQL, a traditional RDBMS,
  • Riak, a key-value database
  • HBase, a columnar database
  • MongoDB, a document-oriented database
  • CouchDB, a document-oriented database,
  • Neo4j, a graph-oriented database, and
  • Redis, a key-value database / data structure server.

All of these databases are open source, and they’re all supported by either a corporate entitiy, a non-profit foundation, or some combination of the two. The title really should have been “Seven Databases in Seven Weekends”; each database is covered in three-day hands-on sessions and could easily be done as a series of weekend projects. The book is hands-on – you’ll build things with these databases, including a Node.js application combining Redis, CouchDB and Neo4j into an application that provides a “band information service.”

Appendix A contains a pair of tables that give an overview of the distinguishing characteristics of the seven databases. As the authors put it, “Although the tables are not a replacement for a true understanding, they should provide you with an at-a-glance sense of what each database is capable of, where it falls short, and how it fits into the modern database landscape.”

I believe all of these databases have a place in modern computational journalism, as do the other two well-known open source RDBMS tools, MySQL and SQLite. In particular, for spatial / mapping projects, PostgreSQL, SQLite, MongoDB and CouchDB have robust geographic information systems capabilities either built in or available as add-ons.

Riak, HBase, MongoDB and CouchDB all support “big data” applications implemented via MapReduce. MongoDB and CouchDB both store their documents as JavaScript Object Notation (JSON) objects, which is the “native” format for Twitter data. Neo4j, as a graph database, is perfect for storing data about relationships, such as the interconnections between corporate executives and legislators. And because of its speed, Redis can serve as high-speed pipelines between other components in almost any application architecture.

I think NoSQL databases will be the core of computational journalism for the next few years. The RDBMS isn’t going away, of course, but if you limit yourself to “SQL thinking” or even “object-relational models” and “model-view-controller” architectures, there will be applications you can’t build. This book will get you up to speed as fast as you’re willing to go.

Apr 052012
 

Computational Journalism Server: SUSE Gallery Download Page

Computational Journalism Server: Github Project

Data Journalism Developer Studio Users Google Group


I’ve just published release 0.2.1 of the Computational Journalism Server to the SUSE Gallery. If you’re interested in beta testing it, please join the Data Journalism Developer Studio Users Google Group.

The Computational Journalism Server is a spinoff / refactoring of the Data Journalism Developer Studio. As I noted last month, it makes no sense for me to maintain and re-distribute a Linux desktop and desktop tools when 80 percent of my users already have a perfectly good non-Linux desktop where they can run those tools! So the plan is to migrate the server-based software from the original appliance into the new server appliance and remove it from the desktop appliance.

In addition, the server appliance is going to evolve to function as a node in a grid / cluster / cloud infrastructure. I’m hoping to eventually package it as an OpenStack compute node. The server appliance will be focused on the R language, CRAN library packages and task views, and whatever Linux packages are required to support the R environment. There are plenty of other platforms out there for Rails, Spring, Node.js, Django, and so forth, but I haven’t seen anything specifically for people who want to develop in R.

The core appliance at the moment consists of the following components:

  • openSUSE 12.1 64 bit server base,
  • The ATLAS high-performance linear algebra library,
  • The R-patched distribution of R. This updates frequently and consists of patches on top of the most recent stable release,
  • The PostgreSQL and SQLite3 databases,
  • The Redis data structure store,
  • R web servers:- Rapache, Rook, websockets, and R Server Pages,
  • The RStudio Server IDE,
  • The Natural Language Processing, Reproducible Research and High Performance Computing task views.

I’ll be posting more documentation on getting started with the Computational Journalism Server in the next few days. I plan to add the Spatial task view in the next week but have no plans for any more task views in the near future. The enhancements / bug fixes I am working on include

  • Packaging as an OpenStack compute node,
  • Rebuilding ATLAS and R-patched from source tuned to the server hardware,
  • Fixing some underlying dependency issues in the High Performance Computing task view,
  • OpenCL integration on NVidia hardware, and
  • Demos of the web server capabilities.
 Posted by at 14:17