Apr 052012
 

Computational Journalism Server: SUSE Gallery Download Page

Computational Journalism Server: Github Project

Data Journalism Developer Studio Users Google Group


I’ve just published release 0.2.1 of the Computational Journalism Server to the SUSE Gallery. If you’re interested in beta testing it, please join the Data Journalism Developer Studio Users Google Group.

The Computational Journalism Server is a spinoff / refactoring of the Data Journalism Developer Studio. As I noted last month, it makes no sense for me to maintain and re-distribute a Linux desktop and desktop tools when 80 percent of my users already have a perfectly good non-Linux desktop where they can run those tools! So the plan is to migrate the server-based software from the original appliance into the new server appliance and remove it from the desktop appliance.

In addition, the server appliance is going to evolve to function as a node in a grid / cluster / cloud infrastructure. I’m hoping to eventually package it as an OpenStack compute node. The server appliance will be focused on the R language, CRAN library packages and task views, and whatever Linux packages are required to support the R environment. There are plenty of other platforms out there for Rails, Spring, Node.js, Django, and so forth, but I haven’t seen anything specifically for people who want to develop in R.

The core appliance at the moment consists of the following components:

  • openSUSE 12.1 64 bit server base,
  • The ATLAS high-performance linear algebra library,
  • The R-patched distribution of R. This updates frequently and consists of patches on top of the most recent stable release,
  • The PostgreSQL and SQLite3 databases,
  • The Redis data structure store,
  • R web servers:- Rapache, Rook, websockets, and R Server Pages,
  • The RStudio Server IDE,
  • The Natural Language Processing, Reproducible Research and High Performance Computing task views.

I’ll be posting more documentation on getting started with the Computational Journalism Server in the next few days. I plan to add the Spatial task view in the next week but have no plans for any more task views in the near future. The enhancements / bug fixes I am working on include

  • Packaging as an OpenStack compute node,
  • Rebuilding ATLAS and R-patched from source tuned to the server hardware,
  • Fixing some underlying dependency issues in the High Performance Computing task view,
  • OpenCL integration on NVidia hardware, and
  • Demos of the web server capabilities.
 Posted by at 14:17
Jan 152012
 

Download Data Journalism Developer Studio 2012 From SUSE Gallery


Update 2012-01-15 – In curating a story on Sentiment Analysis and the 2012 Election, I discovered this blog post by Laurent Luce on Twitter sentiment analysis using Python and NLTK, the natural language processing toolkit. The Python NLTK is not in the base appliance, but it can be installed using the following commands:

> cd /home/studio/Install-Scripts/Python-NLTK
> ./cleanup.bash
> ./install-dependencies.bash
> ./install-bash


Update 2012-01-14 – I added the ‘textir’, ‘tm.webmining’ and ‘tm.sentiment’ library packages to the base appliance in version 2.2.0. So there’s no need to install anything in the base appliance if you want to do sentiment analysis. A good overview of sentiment analysis can be found at Sentiment Analysis and Subjectivity.


If you’ve been following the history of the Data Journalism Developer Studio, you know that it evolved from three previous appliances. Those appliances have been discontinued, but the software in them for the most part lives on in the current one. I’ve been seeing quite a bit of search traffic to my blog coming from the “sentiment analysis” keyword, so I’m posting this mini-guide to getting started.

Sentiment analysis in Data Journalism Developer Studio 2012 is done using the textir R library package. Textir is a “set of tools for inference about text and associated speaker/document sentiment,” created by Assistant Professor of Econometrics and Statistics and Robert L. Graves Faculty Fellow Matt Taddy of the University of Chicago Booth School of Business.

If you’re interested in the mathematics behind this package, Professor Taddy has posted a document to Archiv.org, titled “Inverse Regression for Analysis of Sentiment in Text.” Three sample problems and their solutions are described in the paper: ideology in political speeches, on-line restaurant reviews and business news and stock performance. The political speech, restaurant review and business news datasets are included with the library. See also On Estimation and Selection for Topic Models.

The easiest way to get this package is to install it via Rstudio. Start up Rstudio and select the “Packages” tab in the lower right quadrant. Then press the “Install Packages” button. Type “textir” in the middle line on the form and press “Install”.

 Posted by at 16:40
Jan 152012
 

Data Journalism Developer Studio 2012LX Blog


Last week, two stories broke about vendors showing off their sentiment analysis tools on social media messages about the 2012 election. The “smaller” story is about Twitter “predicting” the results of the New Hampshire primary. The “larger” story is about Facebook making a deal with Politico to share public and private data about the GOP candidates.

As you can imagine this topic is of extreme interest to me, and I’ve taken two steps in researching this story.

  1. I’ve put the CRAN sentiment analysis library packages ‘textir‘ and ‘tm.sentiment‘ into the base Data Journalism Developer Studio 2012 appliance, so you can experiment with this in the comfort and safety of your own home, without having to buy any software.
  2. I’ve started curating news and technology articles on the topic at Scoop.it: Sentiment Analysis and the 2012 Election.

I’m not sure how long this is going to be an active news story. The ACLU has weighed in on the Facebook – Politico deal, but in the larger context of SOPA, it may get lost in the shuffle.

Research papers on sentiment analysis:

 Posted by at 11:29
Dec 312011
 

Data Journalism Developer Studio 2012 Overview

Download Data Journalism Developer Studio 2012 From SUSE Gallery

Data Journalism Developer Studio 2012 on Github

Data Journalism Developer Studio 2012 Blog


I’ve just released Data Journalism Developer Studio 2012. This is a major refactoring of the code base. The major user-visible changes are:

  1. I’ve removed RStudio Server for the time being. It was redundant for most users, and removing it freed up over 100 MB on the released appliances. I do plan to put an installer script for it on the appliance at a later date.
  2. Given the availability of a big chunk of space, I was able to move some frequently-used packages out of the options and into the released appliance. They are
    1. The R Commander GUI. This turns R into a spreadsheet-like user interface. I’ve included the Text Mining plugin as well.
    2. Google Refine. This is another spreadsheet-like tool for working with messy data. The Tesseract Optical Character Recognition package is also included.
    3. Maqetta. This is a WYSIWYG HTML5 user interface builder based on the Dojo JavaScript libraries.
    4. The Perl utilities are back in the main appliance.
  3. I’ve re-organized the install scripts slightly. The BARD re-districting mapping tool is now part of the Spatial task view, and the “beancounter” financial database tool is now part of the Finance task view.

There’s more coming in the next few weeks on the road map. I’ve been testing the Octopress lightweight blogging platform. It’s quite technical – it’s billed as a blogging platform for hackers, and that’s a pretty good description. It’s very lightweight, though, and it works with Github for painless deployment and version control. There will be a sample blog for the Data Journalism Developer Studio 2012 up on Github in a day or so.

Now that the Twitter Perl libraries are back in the main appliance, I’ll be putting my Twitter user and tweet CSV dump routines on the appliance. That way, you’ll be able to acquire tweets or user lists and process them from the appliance desktop.

 Posted by at 23:11