Mar 212012
 

It is now Spring in the Northern Hemisphere, and our Anglo-American tradition calls for spring cleaning. So, in the spirit of the season:

  1. Fundry is closing down. So you will be unable to make donations or fund new features for the Data Journalism Developer Studio there. I don’t have any plans for an alternative revenue stream at this time.

  2. Speaking of revenue, I’m not making enough revenue from Powell’s, Amazon, Clicky, Viralheat or HootSuite affiliate links to justify the effort involved. Over the next few weeks I’ll be removing nearly all those links from this blog. I’m staying in the affiliate programs for the moment, but I don’t have the time to be an active full-time affiliate marketer, and that’s what’s required to make any money out of these programs.

  3. I am shutting down the separate blog for the Data Journalism Developer Studio on Github. Octopress is way too geeky and maintaining two blogs is diluting my focus. It was a noble experiment, but it has failed and it’s time to move on. I’ll move the important posts there over to this blog before I shut it down.

  4. I will most likely be moving this blog from a self-hosted WordPress site to a free blog hosted at WordPress.com. Again, the amount of time I’m spending in plug-in and theme maintenance isn’t justified by a payoff. If I were a professional WordPress developer, selling my services, that would be different. But I’m not and don’t plan to be.

And what of the future? I’m glad you asked!

I’m currently re-evaluating the strategy for the Data Journalism Developer Studio, given some data I’m collecting on my visitors, looking in depth at the tasks data journalists actually do and the questions they ask, and my own desire to get more involved in advanced open source technologies. I don’t have all the details nailed down just yet, but here’s a rough road map.

  1. The current appliance is a Linux desktop. When I look at my visitor data, 56.2% of my visitors use Windows, 23.9% use MacOS X, 10.1% Linux, and 9.7% mobile. I suspect that my Linux numbers are higher than those for most sites, since I’m a long-time Linux professional. But given these numbers, and the fact that all of the desktop tools packaged in the current appliance will run on a Windows or Macintosh desktop as well as a Linux desktop, it makes no sense for me to maintain and re-distribute a Linux desktop and desktop tools when 80 percent of my users already have a perfectly good non-Linux desktop where they can run those tools!

    So I will be freezing the current appliance as soon as I have all the documentation done on how to get the desktop tools if you’re on Windows or MacOS X. If there’s enough demand for it, I’ll port the tool install scripts to Fedora and Debian/Ubuntu/Mint.

  2. I’ve had a number of requests for the Twitter research tools. You’d think people had better ways of capturing Twitter search data, contact lists and tweets from a user, but I keep getting these requests. So I’m considering packaging them as Windows executables using the ActiveState Perl Development Kit. I can also make Macintosh executables with that kit, but since I don’t own a Macintosh for testing, that’s way down on my priority list.

  3. I am creating a new appliance, tentatively called the Data Journalism Advanced Server. I’m prototyping it at present on the SUSE Studio, but I’m not necessarily going to use that as the final delivery mechanism. As the name implies, it will be a server. I’m still evaluating the underlying technologies, but I’m pretty sure of some things:

    1. It will run a 64-bit Linux. Some of the underlying mathematical software is optimized for 64-bit processors, and if I include a virtualization hosting technology, that requires 64-bit processors and some extra flags.

    2. As always, it will be 100% open source.

    3. Much of the software will be built on-appliance from upstream source. My tentative strategy is to build a core openSUSE appliance on SUSE Studio with kvm as the virtualizer, and building the actual server as a Gentoo guest.

    4. It will have R and RStudio Server, plus all of the LaTeX tools required to build R documentation in RStudio Server.

    5. It will have Node.js and Django for sure. It may or may not have Sinatra and Ruby on Rails; I think there are better ways to deploy those technologies than what I’m building.

    6. It will have all of the databases described in Seven Databases In Seven Weeks.

 Posted by at 16:41
Jan 152012
 

Download Data Journalism Developer Studio 2012LX From SUSE Gallery

Download Computational Journalism Server From SUSE Gallery


Update 2012-01-15 – In curating a story on Sentiment Analysis and the 2012 Election, I discovered this blog post by Laurent Luce on Twitter sentiment analysis using Python and NLTK, the natural language processing toolkit. The Python NLTK is not in the base appliance, but it can be installed using the following commands:

> cd /home/studio/Install-Scripts/Python-NLTK
> ./cleanup.bash
> ./install-dependencies.bash
> ./install-bash


Update 2012-01-14 – I added the ‘textir’, ‘tm.webmining’ and ‘tm.sentiment’ library packages to the base appliance in version 2.2.0. So there’s no need to install anything in the base appliance if you want to do sentiment analysis. A good overview of sentiment analysis can be found at Sentiment Analysis and Subjectivity.


If you’ve been following the history of the Data Journalism Developer Studio, you know that it evolved from three previous appliances. Those appliances have been discontinued, but the software in them for the most part lives on in the current one. I’ve been seeing quite a bit of search traffic to my blog coming from the “sentiment analysis” keyword, so I’m posting this mini-guide to getting started.

Sentiment analysis in Data Journalism Developer Studio 2012 is done using the textir R library package. Textir is a “set of tools for inference about text and associated speaker/document sentiment,” created by Assistant Professor of Econometrics and Statistics and Robert L. Graves Faculty Fellow Matt Taddy of the University of Chicago Booth School of Business.

If you’re interested in the mathematics behind this package, Professor Taddy has posted a document to Archiv.org, titled “Inverse Regression for Analysis of Sentiment in Text.” Three sample problems and their solutions are described in the paper: ideology in political speeches, on-line restaurant reviews and business news and stock performance. The political speech, restaurant review and business news datasets are included with the library. See also On Estimation and Selection for Topic Models.

The easiest way to get this package is to install it via Rstudio. Start up Rstudio and select the “Packages” tab in the lower right quadrant. Then press the “Install Packages” button. Type “textir” in the middle line on the form and press “Install”.

 Posted by at 16:40
Jan 152012
 

Data Journalism Developer Studio 2012LX Blog


Last week, two stories broke about vendors showing off their sentiment analysis tools on social media messages about the 2012 election. The “smaller” story is about Twitter “predicting” the results of the New Hampshire primary. The “larger” story is about Facebook making a deal with Politico to share public and private data about the GOP candidates.

As you can imagine this topic is of extreme interest to me, and I’ve taken two steps in researching this story.

  1. I’ve put the CRAN sentiment analysis library packages ‘textir‘ and ‘tm.sentiment‘ into the base Data Journalism Developer Studio 2012 appliance, so you can experiment with this in the comfort and safety of your own home, without having to buy any software.
  2. I’ve started curating news and technology articles on the topic at Scoop.it: Sentiment Analysis and the 2012 Election.

I’m not sure how long this is going to be an active news story. The ACLU has weighed in on the Facebook – Politico deal, but in the larger context of SOPA, it may get lost in the shuffle.

Research papers on sentiment analysis:

 Posted by at 11:29
Dec 312011
 

Data Journalism Developer Studio 2012 Overview

Download Data Journalism Developer Studio 2012 From SUSE Gallery

Data Journalism Developer Studio 2012 on Github

Data Journalism Developer Studio 2012 Blog


I’ve just released Data Journalism Developer Studio 2012. This is a major refactoring of the code base. The major user-visible changes are:

  1. I’ve removed RStudio Server for the time being. It was redundant for most users, and removing it freed up over 100 MB on the released appliances. I do plan to put an installer script for it on the appliance at a later date.
  2. Given the availability of a big chunk of space, I was able to move some frequently-used packages out of the options and into the released appliance. They are
    1. The R Commander GUI. This turns R into a spreadsheet-like user interface. I’ve included the Text Mining plugin as well.
    2. Google Refine. This is another spreadsheet-like tool for working with messy data. The Tesseract Optical Character Recognition package is also included.
    3. Maqetta. This is a WYSIWYG HTML5 user interface builder based on the Dojo JavaScript libraries.
    4. The Perl utilities are back in the main appliance.
  3. I’ve re-organized the install scripts slightly. The BARD re-districting mapping tool is now part of the Spatial task view, and the “beancounter” financial database tool is now part of the Finance task view.

There’s more coming in the next few weeks on the road map. I’ve been testing the Octopress lightweight blogging platform. It’s quite technical – it’s billed as a blogging platform for hackers, and that’s a pretty good description. It’s very lightweight, though, and it works with Github for painless deployment and version control. There will be a sample blog for the Data Journalism Developer Studio 2012 up on Github in a day or so.

Now that the Twitter Perl libraries are back in the main appliance, I’ll be putting my Twitter user and tweet CSV dump routines on the appliance. That way, you’ll be able to acquire tweets or user lists and process them from the appliance desktop.

 Posted by at 23:11
Dec 272011
 

 

I was on the webinar that introduced this book, along with thousands of others. Over 30,000 registered, and the count of people who attended was 10,899. If you had a Kindle, you could download this book for free. I did. In fact, if you’re an Amazon Prime member, you can still get the Kindle edition for free. Dan Zarrella has been collecting Twitter and Facebook data for some time now, and this book is the result of a careful study of what works, what doesn’t work, and what sometimes works. It’s an easy read and very useful.


 

Pulse is along the lines of the previous book, but is much more detailed, and talks about more data sources. Moreover, it’s research-oriented – how to do the sort of thing Dan Zarrella did to define his hierarchy of contagiousness.

Pulse covers Google search trends, Facebook connections, blogs and Twitter in some detail. THe themes are “what we surf, who we friend and what we say.” Chapter 8 describes three “potential pulses” – “where we go, what we buy and how we play”. You should think of this book as an overview – you’ll need to dig deeper if you want to implement any of the research described.


 

One of the recurring themes in the past two years has been the so-called “lean startup”. I have some skepticism about the concept, particularly the way it’s been described in blogs. So it’s refreshing to see a book like Venture Deals come out that’s full of actual meat, not just admonitions to “fail faster.” The authors were here in Portland a few weeks ago for a well-attended lecture on the contents, and I have a signed copy. If you’re starting a business, you need to read this book first.


I’ve covered this book at some length here, so I’ll defer you to the previous review. While it’s very advanced technically, it’s the best book published this year on trading technology. Most of the year’s other trading books are rehashes of decades-old technical analysis methods that may or may not work any more. If you want to be a trader, I highly recommend getting this book.


We heard a lot about economic inequality this year, and we’ll hear a lot more in 2012. Whether it’s Occupy Wall Street, calls for higher taxes on the wealthy by Warren Buffett, or President Obama’s speech in Osawatomie, Kansas, economic inequality and how to reverse it have become a topic of interest.

Changing Inequality is an easy-to-read book on the subject. It traces the causes of economic inequality over the past three decades, and suggests a few possible ways to reverse the trend. It should be noted, however, that reversing a 30-year trend of rising inequality isn’t easy. For example, as noted in the book,

In the late 1980s, when it first became clear that rapid increases in inequality were more than a short-term or cyclical phenomenon, researchers began to look for causes. It was almost a decade before widespread consensus was reached among economists that these changes were largely driven by skill-biased increases in demand, many of them probably the result of technological changes linked to a growing use of computer technologies.

The challenge for policy-makers is to devise policies that promote both growth and equality. Changing Inequality is a good place to start.

 Posted by at 15:42