May 022012
 

As I’ve noted here, the Computational Journalism Server “wants to be a Platform-as-a-Service (PaaS) when it grows up.” In plotting the way forward to that goal, I’ve looked at three options:

  1. Remain on openSUSE / SUSE Studio and collect other open source tools to provide the additional services that would make the current server into a true PaaS.
  2. Start with the Cloud Foundry or one of its derivatives and add the computational journalism tools.
  3. Start with the Red Hat OpenShift Origin PaaS and add the computational journalism tools.

Remain on openSUSE

If I remain on openSUSE, as noted above, I’d need to collect more tools to provide additional services. Many of these deal with the underlying infrastructure. My target infrastructure is OpenStack Essex. That’s still the target. Moreover, the overall goal is still to provide an R-language tool set for dealing with large-scale computational journalism problems using library packages in the Comprehensive R Archive Network that implement parallel, cluster and grid computing.

When I started the original Data Journalism Developer Studio project, the Platform as a Service concept was in its infancy. It has matured rapidly, though. Cloud Foundry and Red Hat OpenShift are both about a year old, and several derivatives of Cloud Foundry have already appeared.

Cloud Foundry runs on Ubuntu Linux and OpenShift on Red Hat / Fedora. There isn’t an equivalent packaged solution for openSUSE. I’d have to build that for the Computational Journalism Server to be a true PaaS. And that seems to me a diversion from the mission.

Cloud Foundry

Cloud Foundry is an open source project from VMware. Derivative projects include AppFog for PHP projects, PaaS.io for Haskell, Stackato for Perl and Python and a community fork called CloudFreeStyle. Of these, neither AppFog nor PaaS.io are relevant, since there’s little PHP or Haskell in the current or planned use cases for the Computational Journalism Server. So the options are Cloud Foundry itself, Stackato or CloudFreeStyle.

I’ve ruled out the base Cloud Foundry for two main reasons.

  1. Most of the frameworks, services and tools provided by Cloud Foundry are totally unfamiliar to me. I know enough Ruby to do simple scripting and enough Java to call class libraries, but I know virtually nothing about Ruby on Rails, and I know absolutely nothing about Spring, Scala, RabbitMQ or Eclipse. I’d have a steep learning curve on tools that aren’t relevant to the core of the Computational Journalism Server – R, SQL and NoSQL databases, Hadoop and to a lesser extent Perl and Python.
  2. For an open source project with a year’s history under its belt, the documentation is, in a word, abysmal. In particular, the tasks I need to accomplish to make Computational Journalism Server into a PaaS – primarily adding R parallel programming capabilities and application packages – are totally undocumented.

I’m a member of the CloudFreeStyle project. I joined because I wanted to learn how to do what’s in Cloud Foundry and how to enhance it. Because it’s a source-level fork of Cloud Foundry, it would be easy to add functionality and ignore the components I don’t need, at least in the beginning. The “glue logic” to talk to applications and to the cloud infrastructure is already there and should “just work”. But, like its parent, there’s little documentation and I’d have to figure things out from the source.

Finally, there’s Stackato. Stackato is a very impressive product and ActiveState’s documentation, support and tool set is world-class. I’ve been a happy ActiveState Perl Development Kit user for years. If the Computational Journalism Server was a commercial product / business venture rather than an open source project, I’d go with Stackato. But the Computational Journalism Server isn’t there yet and may never be.

OpenShift Origin

OpenShift Origin, released on April 30, 2012, is an open source PaaS construction platform from Red Hat. I’ve spent about a day and a half browsing the documentation and I’m blown away by how comprehensive it is, especially for someone like me who wants to build a tool set from the ground up. The OpenShift Origin documentation is every bit as awesome as Cloud Foundry’s is abysmal.

The demos include a number of “LAMP stack oldies but goodies” – MediaWiki, WordPress, and Drupal. There’s also an OpenShift Origin LiveCD, based on Fedora 16, that turns any 64-bit Intel / AMD workstation, laptop or virtual machine into an OpenShift PaaS. With a few additional steps you can install OpenShift Origin permanently on a real or virtual machine.

The Way Forward

At the moment, I’m keeping three options open:

  1. Remain on openSUSE,
  2. CloudFreeStyle, and
  3. OpenShift Origin.

But I suspect the strength of the documentation will pull the project towards OpenShift Origin sooner rather than later. “Watch this space,” as the saying goes.

Apr 272012
 

Computational Journalism Server: SUSE Gallery Download Page

Computational Journalism Server: Github Project

Data Journalism Developer Studio Users Google Group


Computational Journalism Studio 1.0.0 is now available in the SUSE Studio Gallery at http://j.mp.compjournoserver. It’s not exactly a full Platform-as-a-Service yet, which is the eventual goal, but just about all the components I want in the appliance are there. I’ve put fairly detailed instructions on getting started on the SUSE Gallery download page and they’re duplicated on the Github project page. If you run into difficulties, please feel free to comment here, send me a tweet, post an issue on Github, make a comment on the appliance download page, or rant on the Google Group. No carrier pigeons, please – I’ll just cook them for dinner.

I’ve got fairly big plans for the evolution of this tool set. In rough calendar order but with no firm dates yet, they are:

  • Get the major use cases described in Parallel R working,
  • Get the R integration with the ATLAS libraries working by default,
  • Get the appliance working as an lxc Linux Containers guest,
  • Package the appliance so it can function as an OpenStack Essex compute node,
  • Port the workstation development / testing environment to Ubuntu 12.04 and Fedora 17, and
  • Port the server to CloudFreeStyle and OpenShift Platform-as-a-Service environments.

 

Apr 052012
 

Computational Journalism Server: SUSE Gallery Download Page

Computational Journalism Server: Github Project

Data Journalism Developer Studio Users Google Group


I’ve just published release 0.2.1 of the Computational Journalism Server to the SUSE Gallery. If you’re interested in beta testing it, please join the Data Journalism Developer Studio Users Google Group.

The Computational Journalism Server is a spinoff / refactoring of the Data Journalism Developer Studio. As I noted last month, it makes no sense for me to maintain and re-distribute a Linux desktop and desktop tools when 80 percent of my users already have a perfectly good non-Linux desktop where they can run those tools! So the plan is to migrate the server-based software from the original appliance into the new server appliance and remove it from the desktop appliance.

In addition, the server appliance is going to evolve to function as a node in a grid / cluster / cloud infrastructure. I’m hoping to eventually package it as an OpenStack compute node. The server appliance will be focused on the R language, CRAN library packages and task views, and whatever Linux packages are required to support the R environment. There are plenty of other platforms out there for Rails, Spring, Node.js, Django, and so forth, but I haven’t seen anything specifically for people who want to develop in R.

The core appliance at the moment consists of the following components:

  • openSUSE 12.1 64 bit server base,
  • The ATLAS high-performance linear algebra library,
  • The R-patched distribution of R. This updates frequently and consists of patches on top of the most recent stable release,
  • The PostgreSQL and SQLite3 databases,
  • The Redis data structure store,
  • R web servers:- Rapache, Rook, websockets, and R Server Pages,
  • The RStudio Server IDE,
  • The Natural Language Processing, Reproducible Research and High Performance Computing task views.

I’ll be posting more documentation on getting started with the Computational Journalism Server in the next few days. I plan to add the Spatial task view in the next week but have no plans for any more task views in the near future. The enhancements / bug fixes I am working on include

  • Packaging as an OpenStack compute node,
  • Rebuilding ATLAS and R-patched from source tuned to the server hardware,
  • Fixing some underlying dependency issues in the High Performance Computing task view,
  • OpenCL integration on NVidia hardware, and
  • Demos of the web server capabilities.
 Posted by at 14:17
Jan 152012
 

Download Data Journalism Developer Studio 2012 From SUSE Gallery


Update 2012-01-15 – In curating a story on Sentiment Analysis and the 2012 Election, I discovered this blog post by Laurent Luce on Twitter sentiment analysis using Python and NLTK, the natural language processing toolkit. The Python NLTK is not in the base appliance, but it can be installed using the following commands:

> cd /home/studio/Install-Scripts/Python-NLTK
> ./cleanup.bash
> ./install-dependencies.bash
> ./install-bash


Update 2012-01-14 – I added the ‘textir’, ‘tm.webmining’ and ‘tm.sentiment’ library packages to the base appliance in version 2.2.0. So there’s no need to install anything in the base appliance if you want to do sentiment analysis. A good overview of sentiment analysis can be found at Sentiment Analysis and Subjectivity.


If you’ve been following the history of the Data Journalism Developer Studio, you know that it evolved from three previous appliances. Those appliances have been discontinued, but the software in them for the most part lives on in the current one. I’ve been seeing quite a bit of search traffic to my blog coming from the “sentiment analysis” keyword, so I’m posting this mini-guide to getting started.

Sentiment analysis in Data Journalism Developer Studio 2012 is done using the textir R library package. Textir is a “set of tools for inference about text and associated speaker/document sentiment,” created by Assistant Professor of Econometrics and Statistics and Robert L. Graves Faculty Fellow Matt Taddy of the University of Chicago Booth School of Business.

If you’re interested in the mathematics behind this package, Professor Taddy has posted a document to Archiv.org, titled “Inverse Regression for Analysis of Sentiment in Text.” Three sample problems and their solutions are described in the paper: ideology in political speeches, on-line restaurant reviews and business news and stock performance. The political speech, restaurant review and business news datasets are included with the library. See also On Estimation and Selection for Topic Models.

The easiest way to get this package is to install it via Rstudio. Start up Rstudio and select the “Packages” tab in the lower right quadrant. Then press the “Install Packages” button. Type “textir” in the middle line on the form and press “Install”.

 Posted by at 16:40
Jan 152012
 

Data Journalism Developer Studio 2012LX Blog


Last week, two stories broke about vendors showing off their sentiment analysis tools on social media messages about the 2012 election. The “smaller” story is about Twitter “predicting” the results of the New Hampshire primary. The “larger” story is about Facebook making a deal with Politico to share public and private data about the GOP candidates.

As you can imagine this topic is of extreme interest to me, and I’ve taken two steps in researching this story.

  1. I’ve put the CRAN sentiment analysis library packages ‘textir‘ and ‘tm.sentiment‘ into the base Data Journalism Developer Studio 2012 appliance, so you can experiment with this in the comfort and safety of your own home, without having to buy any software.
  2. I’ve started curating news and technology articles on the topic at Scoop.it: Sentiment Analysis and the 2012 Election.

I’m not sure how long this is going to be an active news story. The ACLU has weighed in on the Facebook – Politico deal, but in the larger context of SOPA, it may get lost in the shuffle.

Research papers on sentiment analysis:

 Posted by at 11:29