May 162012
 

I’ve just released Computational Journalism Server 1.2.6. There are two major updates to the functionality.

  1. I’ve added a script to upgrade the server to a desktop. The “install-lxde.bash” script installs a full LXDE desktop. You’ll have the Firefox browser, the Claws email client, the AbiWord word processor, Gnumeric spreadsheet, the Leafpad text editor, the LXTerminal terminal emulator, ePDFViewer and a graphical file manager.
  2. Given a desktop install, I’ve added scripts to download and install the prototype Overview tool for semantic visualization and hierarchical clustering of large document sets. I’ve been experimenting with this for three days now and I think it belongs in every computational journalist’s tool set.

I wrote a bit about Overview two days ago. The Overview team has an ambitious road map and I’m pretty confident their approach will help working journalists make sense of the volumes of text available. The tool as documented on the Overview web site runs on Windows and Macintosh personal computers with no modifications. In addition to the ability to install Overview in a desktop-enhanced Computational Journalism Server, I’ve provided scripts to install it on any openSUSE 12.1, Fedora 17 or Ubuntu 12.02 desktop. With minor tweaks it should run on older versions or other Linux distributions.

Here’s a sneak peek at the documentation for the new features, derived from https://github.com/znmeb/Computational-Journalism-Server/blob/master/Overview/README.md

Running Overview on the Computational Journalism Server

What’s Overview?

Overview is a tool for semantic visualization and hierarchical clustering of large document sets. Jonathan Stray of the Associated Press leads the development, with funding provided by a Knight Foundation grant.

The main project page is at http://overview.ap.org/. It’s an open source project and its repositories are on Github at https://github.com/overview. And they have a Twitter account: @overviewproject.

If you want to run the prototype of Overview on a Windows or Macintosh personal computer, the instructions are here: Getting Started with the Overview Prototype. If you want to run Overview on the Computational Journalism Server or a Linux Desktop, read on!

Running the Overview Prototype on the Computational Journalism Server

  1. You’ll need to download and install the Computational Journalism Server first. I recommend doing the “install-all.bash” full install rather than just installing the base appliance.
  2. Next, you’ll need to install the LXDE desktop. As “root”, do# cd /opt/Computational-Journalism-Server
    # ./install-lxde.bash

    When the script asks if you want to trust the repository, answer “a” for “always”.

  3. After the LXDE desktop repositories, patterns and packages are installed, you’ll be sent to a YaST2 session to change the default run level. Enter “Expert Mode”. Tab to the “Set default runlevel after booting to:” field and select “5: Full multiuser with network and display manager”. Then tab to “OK” and press “Enter”.
  4. Reboot and log in on the console as the non-root useryou created when you installed the appliance. Select “LXDE” in the “Desktop” pulldown menu. You should be in the LXDE desktop.The “install-lxde.bash” script installs a full LXDE desktop. You’ll have the Firefox browser, the Claws email client, the AbiWord word processor, Gnumeric spreadsheet, the Leafpad text editor, the LXTerminal terminal emulator, ePDFViewer and a graphical file manager.
  5. Open an “LX Terminal”. The menu button is in the lower left. Start the menu and select “System Tools -> LXTerminal”.
  6. In the terminal, type$ cd /opt/Computational-Journalism-Server
    $ cp -a Overview/ ~

    This creates a copy in your home directory where you can write.

  7. Type$ cd ~/Overview
    $ ./install-overview-openSUSE.bash

    This will take quite a bit of time to download and recompile the required Ruby 1.9.3.

  8. Type$ ./test-overview.bash

    This will run all the test cases. The “caracas” example takes quite a bit of time in the Ruby NLP step, but the others run fairly quickly on my 8 GB dual-core laptop.

Running the Overview Prototype on Linux Desktop / Laptop

If you already have a Linux desktop, do the following. I’ve tested this on openSUSE 12.1, Ubuntu 12.04 “Precise Pangolin” and Fedora 17 “Beefy Miracle”. It can probably be made to work on older Fedora or Ubuntu desktops with a little tweaking, but I’m not testing it on them. It will get tested on openSUSE 12.2 when the beta comes out.

  1. Install “git”.
  2. In some convenient directory where you have write access, type$ git clone http://github.com/znmeb/Computational-Journalism-Server
    $ cd Computational-Journalism-Server/Overview
    $ ./install-overview-<DISTRO>.bash

    where <DISTRO> is Fedora, Ubuntu or openSUSE. Do not run this as “root” – use an ordinary user account!

  3. That will take some time; as noted above, one of the steps is to download and recompile Ruby 1.9.3 from source. When it’s done, type “./test-overview.bash” as above.
May 102012
 

Version 1.2.5 Released

I’ve just pushed all the buttons to release Computational Journalism Server 1.2.5. It’s mostly bug fixes and miscellaneous cleanup changes, but there is one new major option: Apache™ Hadoop™. Right now, all that’s there is a script to download and install the latest stable Hadoop from Apache and run the single-node test script. But it should be enough for developers to start testing the R Hadoop interface routines ‘HadoopStreaming‘ and ‘hive‘. See Parallel R for some sample code using ‘HadoopStreaming’ and ‘hive’.

To install Hadoop, do the following:

  1. Log into the server as “root”.
  2. Type “cd ~/Computational-Journalism-Server/Hadoop”.
  3. Type “./install-hadoop.bash”.
  4. Type “./test-hadoop.bash”.

This should install Hadoop and run the single-user test. For more information on configuring larger-scale Hadoop clusters, see the main documentation page at http://hadoop.apache.org/common/docs/r1.0.2/. The scripts in this release came from the Single Node Setup page.

Road Map

As a few recent posts on this blog have noted, I’m planning to migrate the platform components of Computational Journalism Server to either CloudFreeStyle or OpenShift or both, to take advantage of their existing platform-level components and community support structures. I don’t have a good estimate of dates yet, but there will be at least one more release as an openSUSE appliance before there’s any OpenShift or CloudFreeStyle release. There will also be one more release of Data Journalism Developer Studio 2012LX to catch up to the openSUSE Build Service packages.

May 082012
 

There are quite a few books out now on “data science”. I’ve picked out three that I think are the best place to start for computational journalists. First is Machine Learning for Hackers, by Drew Conway and John Myles White. The authors are frequent contributors to the #rstats hashtag; R is the “native language” for this book. Topics of interest to computational journalists include

  • Email processing using Bayesian classifiers to detect spam,
  • An analysis of “polarization” in roll call votes in the United States Senate, and
  • Building a Twitter follower recommendation system.

I’m in the final stages of releasing version 1.2.5 of the Computational Journalism Server, and one of the goals is for the examples in this book to run.

Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites, by Matthew A. Russell, uses the Python language rather than R. This might make it more accessible to computational journalists. The book includes

  • scripts for accessing Facebook, Twitter, LinkedIn and Google+ APIs,
  • an excellent explanation of the basics of statistical natural language processing,
  • tools for building HTML5 / JavaScript visualizations, and
  • tools for exploring microformats.

This is more of a workstation resource than a server resource, so I’d recommend downloading Data Journalism Developer Studio 2012LX to experiment with these tools.

Finally, for those of you working in the area of politics and social media, I highly recommend Social Network Analysis for Startups: Finding connections on the social web by Maksim Tsvetovat and Alexander Kouznetsov. Like the previous book, the “native language” here is Python. Some of the topics and tools are also covered in Mining the Social Web, but there’s more depth here. You’ll also want to review the free webinars from O’Reilly on the subjects in the book.

 

May 042012
 

If you follow me on Twitter, you’ve probably seen my Scoop.it topic posts. I’ve been on Scoop.it since mid-January, and I recently signed up for the free trial of the “Pro” version. I’m planning to continue with the Pro version, which features analytics and up to ten topics; the free version only allows five topics and has no analytics.

What do I like about Scoop.it? First of all, it’s easy to set up a topic and start collecting articles. You simply enter search keywords and Scoop.it searches Twitter, YouTube, Digg, Google Blogs and Google News feeds for matches. You can add any RSS feed, search a Twitter user, list or search stream. You can add a Facebook page or import an OPML file.

There’s a bit of art to selecting the keywords. Too specific gets you few articles and too general gets you too many. For example, for my topic “Social Media Analytics and US Politics“, I started with a specific story about Facebook sharing data with Politico. This was too specific; I later added keywords for “social media analytics” and political terms like “President” and “election.” This yielded far too many hits, so I dropped the political terms and filtered out the non-political stories by hand. Someone with more experience in search engine keyword analysis could probably do this much faster than I did.

Once the collection is in progress, as a curator, you receive a stream of potential “scoops”, presented with the newest entry first. You can either discard or accept each suggestion. If you accept it, the “scoop” is pushed onto the top of your topic, displacing older entries. There are numerous controls for resizing and repositioning your scoops on the page. There’s a comment feature, although I have yet to receive any comments on my scoops.

There’s a “Star” option that will push a scoop to the top left position. There’s a “re-scoop” button that allows you to take any scoop from one of your topics or anyone else’s topics and add it to your own. You can share individual scoops on Twitter, LinkedIn, Pinterest and StumbleUp and whole topics on Twitter and Facebook. You can also share to Tumblr and WordPress blogs. And there’s a bookmarklet you can use to capture scoops while browsing.

I’m not going to say much about the analytics, because they aren’t part of the free service and I’ve only been using them for a week. I will say, though, that if you’re serious about the platform you’ll at least want to do the free trial and check them out for yourself. There’s also an API, and Google+ integration is coming.

Finally, I want to say a bit about the Scoop.it network. So far, it seems to be joyfully spam-free. I follow about 100 topics and have discovered some interesting people on Scoop.it that I probably would not have discovered by chance on Twitter. My most popular topic, “Computational and Data Journalism“, has about 35 followers. I don’t know how rapidly the network is growing, though. I’m planning to stay with it at least a few more months, and I invite everyone to come join me, even if it’s only at the free level.

May 032012
 

A few weeks ago, I volunteered for a Wikipedia editing hack session. In the course of the session, I browsed by the page on Computational Journalism. It’s quite sparse, and as a result, I decided to collect my thoughts on exactly what computational journalism is. I’m still collecting – for me, any discipline is defined by the practitioners, what they do and the tools they use. But I’ve collected a few articles and books that I think are a good place to start.

I’d recommend starting with this article: Computational Journalism: How computer scientists can empower journalists, democracy’s watchdogs, in the production of news in the public interest.

Key Insights

  • The public-interest journalism on which democracy depends is under enormous financial and technological pressure.
  • Computer scientists help journalists cope with these pressures by developing new interfaces, indexing algorithms, and data-extraction techniques.
  • For public-interest journalism to thrive, computer scientists and journalists must work together, with each learning elements of the other’s trade.

Although it’s quite brief, this article defines well the frontiers of computational journalism. In particular, the authors call out five “areas of opportunity”:

  1. Combining information from varied digital sources.
  2. Information extraction.
  3. Document exploration and redundancy.
  4. Audio and video indexing.
  5. Extracting data from forms and reports.

With that article as a basis, I’d recommend the recently-published Data Journalism Handbook. While I consider data journalism a proper subset of computational journalism, the concepts are better known, so it’s a good place to start the journey. It’s also quite comprehensive and down-to-earth. And it’s free.

Participatory Journalism is a collection of essays covering the emerging trends of social media, user-generated content and the new ways journalists interact with their audiences. It’s based mostly on interviews with journalists in major newspapers around the world. I found it rather wordy and overly long, but there really isn’t any other book I’ve found that describes these trends from the point of view of people actually working in newsrooms. And their challenges feed right back into the areas of opportunity called out in the ACM article.

Finally, for those of you looking for a foundation course in journalism, I highly recommend Andy Bull’s Multimedia Journalism. Once you buy the textbook, you get access to the entire web site. It’s been many years since I took a journalism course, and I found that reading the textbook was vital for me as a developer to understand what journalists need from the tools and the technologies.