Borasky Research Journal Google+ Page

Borasky Research Journal Amazon Store


Data Journalism Developer Studio 2012LX

 

Download Data Journalism Developer Studio 2012 From SUSE Gallery


Update 2012-01-15 – In curating a story on Sentiment Analysis and the 2012 Election, I discovered this blog post by Laurent Luce on Twitter sentiment analysis using Python and NLTK, the natural language processing toolkit. The Python NLTK is not in the base appliance, but it can be installed using the following commands:

> cd /home/studio/Install-Scripts/Python-NLTK
> ./cleanup.bash
> ./install-dependencies.bash
> ./install-bash


Update 2012-01-14 – I added the ‘textir’, ‘tm.webmining’ and ‘tm.sentiment’ library packages to the base appliance in version 2.2.0. So there’s no need to install anything in the base appliance if you want to do sentiment analysis. A good overview of sentiment analysis can be found at Sentiment Analysis and Subjectivity.


If you’ve been following the history of the Data Journalism Developer Studio, you know that it evolved from three previous appliances. Those appliances have been discontinued, but the software in them for the most part lives on in the current one. I’ve been seeing quite a bit of search traffic to my blog coming from the “sentiment analysis” keyword, so I’m posting this mini-guide to getting started.

Sentiment analysis in Data Journalism Developer Studio 2012 is done using the textir R library package. Textir is a “set of tools for inference about text and associated speaker/document sentiment,” created by Assistant Professor of Econometrics and Statistics and Robert L. Graves Faculty Fellow Matt Taddy of the University of Chicago Booth School of Business.

If you’re interested in the mathematics behind this package, Professor Taddy has posted a document to Archiv.org, titled “Inverse Regression for Analysis of Sentiment in Text.” Three sample problems and their solutions are described in the paper: ideology in political speeches, on-line restaurant reviews and business news and stock performance. The political speech, restaurant review and business news datasets are included with the library. See also On Estimation and Selection for Topic Models.

The easiest way to get this package is to install it via Rstudio. Start up Rstudio and select the “Packages” tab in the lower right quadrant. Then press the “Install Packages” button. Type “textir” in the middle line on the form and press “Install”.

 

Data Journalism Developer Studio 2012LX Blog


Last week, two stories broke about vendors showing off their sentiment analysis tools on social media messages about the 2012 election. The “smaller” story is about Twitter “predicting” the results of the New Hampshire primary. The “larger” story is about Facebook making a deal with Politico to share public and private data about the GOP candidates.

As you can imagine this topic is of extreme interest to me, and I’ve taken two steps in researching this story.

  1. I’ve put the CRAN sentiment analysis library packages ‘textir‘ and ‘tm.sentiment‘ into the base Data Journalism Developer Studio 2012 appliance, so you can experiment with this in the comfort and safety of your own home, without having to buy any software.
  2. I’ve started curating news and technology articles on the topic at Scoop.it: Sentiment Analysis and the 2012 Election.

I’m not sure how long this is going to be an active news story. The ACLU has weighed in on the Facebook – Politico deal, but in the larger context of SOPA, it may get lost in the shuffle.

Research papers on sentiment analysis:

 

Data Journalism Developer Studio 2012 Overview

Download Data Journalism Developer Studio 2012 From SUSE Gallery

Data Journalism Developer Studio 2012 on Github

Data Journalism Developer Studio 2012 Blog


I’ve just released Data Journalism Developer Studio 2012. This is a major refactoring of the code base. The major user-visible changes are:

  1. I’ve removed RStudio Server for the time being. It was redundant for most users, and removing it freed up over 100 MB on the released appliances. I do plan to put an installer script for it on the appliance at a later date.
  2. Given the availability of a big chunk of space, I was able to move some frequently-used packages out of the options and into the released appliance. They are
    1. The R Commander GUI. This turns R into a spreadsheet-like user interface. I’ve included the Text Mining plugin as well.
    2. Google Refine. This is another spreadsheet-like tool for working with messy data. The Tesseract Optical Character Recognition package is also included.
    3. Maqetta. This is a WYSIWYG HTML5 user interface builder based on the Dojo JavaScript libraries.
    4. The Perl utilities are back in the main appliance.
  3. I’ve re-organized the install scripts slightly. The BARD re-districting mapping tool is now part of the Spatial task view, and the “beancounter” financial database tool is now part of the Finance task view.

There’s more coming in the next few weeks on the road map. I’ve been testing the Octopress lightweight blogging platform. It’s quite technical – it’s billed as a blogging platform for hackers, and that’s a pretty good description. It’s very lightweight, though, and it works with Github for painless deployment and version control. There will be a sample blog for the Data Journalism Developer Studio 2012 up on Github in a day or so.

Now that the Twitter Perl libraries are back in the main appliance, I’ll be putting my Twitter user and tweet CSV dump routines on the appliance. That way, you’ll be able to acquire tweets or user lists and process them from the appliance desktop.

 

Data Journalism Developer Studio 2012LX Blog


I was on the webinar that introduced this book, along with thousands of others. Over 30,000 registered, and the count of people who attended was 10,899. If you had a Kindle, you could download this book for free. I did. In fact, if you’re an Amazon Prime member, you can still get the Kindle edition for free. Dan Zarrella has been collecting Twitter and Facebook data for some time now, and this book is the result of a careful study of what works, what doesn’t work, and what sometimes works. It’s an easy read and very useful.


Pulse is along the lines of the previous book, but is much more detailed, and talks about more data sources. Moreover, it’s research-oriented – how to do the sort of thing Dan Zarrella did to define his hierarchy of contagiousness.

Pulse covers Google search trends, Facebook connections, blogs and Twitter in some detail. THe themes are “what we surf, who we friend and what we say.” Chapter 8 describes three “potential pulses” – “where we go, what we buy and how we play”. You should think of this book as an overview – you’ll need to dig deeper if you want to implement any of the research described.


One of the recurring themes in the past two years has been the so-called “lean startup”. I have some skepticism about the concept, particularly the way it’s been described in blogs. So it’s refreshing to see a book like Venture Deals come out that’s full of actual meat, not just admonitions to “fail faster.” The authors were here in Portland a few weeks ago for a well-attended lecture on the contents, and I have a signed copy. If you’re starting a business, you need to read this book first.


I’ve covered this book at some length here, so I’ll defer you to the previous review. While it’s very advanced technically, it’s the best book published this year on trading technology. Most of the year’s other trading books are rehashes of decades-old technical analysis methods that may or may not work any more. If you want to be a trader, I highly recommend getting this book.


We heard a lot about economic inequality this year, and we’ll hear a lot more in 2012. Whether it’s Occupy Wall Street, calls for higher taxes on the wealthy by Warren Buffett, or President Obama’s speech in Osawatomie, Kansas, economic inequality and how to reverse it have become a topic of interest.

Changing Inequality is an easy-to-read book on the subject. It traces the causes of economic inequality over the past three decades, and suggests a few possible ways to reverse the trend. It should be noted, however, that reversing a 30-year trend of rising inequality isn’t easy. For example, as noted in the book,

In the late 1980s, when it first became clear that rapid increases in inequality were more than a short-term or cyclical phenomenon, researchers began to look for causes. It was almost a decade before widespread consensus was reached among economists that these changes were largely driven by skill-biased increases in demand, many of them probably the result of technological changes linked to a growing use of computer technologies.

The challenge for policy-makers is to devise policies that promote both growth and equality. Changing Inequality is a good place to start.

 

I’ve collected some resources on income and wealth inequality. I’m just now digging into the mathematics, so this list will no doubt grow.




America’s Growing Income Gap, by the Numbers - ProPublica http://t.co/jYuKrNt5 $MACRO
@znmeb
M. Edward Borasky
Income Inequality Near You - ProPublica http://t.co/D8CpQBHQ $$
@znmeb
M. Edward Borasky
Changing Inequality by Rebecca M. Blank http://t.co/1u0hYCMJ $$
@znmeb
M. Edward Borasky
Wealth inequality in the United States - Wikipedia, the free encyclopedia http://t.co/3YIvAfEq
@znmeb
M. Edward Borasky
Economic inequality - Wikipedia, the free encyclopedia http://t.co/a2XINHIb
@znmeb
M. Edward Borasky
Income inequality metrics - Wikipedia, the free encyclopedia http://t.co/155c9cGd
@znmeb
M. Edward Borasky
Gini coefficient - Wikipedia, the free encyclopedia http://t.co/ygZqppcj
@znmeb
M. Edward Borasky
Lorenz curve - Wikipedia, the free encyclopedia http://t.co/E7OjIiwC
@znmeb
M. Edward Borasky
Pareto distribution - Wikipedia, the free encyclopedia http://t.co/pahxtTiK
@znmeb
M. Edward Borasky
 

 

Yet it seems to me that both inside the administration and outside of it there’s a shortage of turning to economists with specific expertise in recessions for advice on coping with the recession. These things are rare events. Buffett’s a smart guy and an old guy, but the USA has never been in this situation throughout the entirety of even his business career.

via The President Should Call Some Economists | ThinkProgress.

There are thousands of government economists in Washington and elsewhere, many of them noted PhDs. They report to the President via several Cabinet departments. They work in the Federal Reserve Board. They work for Congress as staffers or in the Congressional Budget Office. We pay their salaries via taxes, just like we pay our elected officials. In short, the President has “called some economists”. So have Congress and the Federal Reserve Board.

The media can and should hold these economists accountable, just as they do elected officials and corporate leaders like Buffett and Mulally. And it turns out that this is easy to do. The Departments of Treasury, Commerce and Labor all have excellent web sites with data, analysis and research papers. The Federal Reserve Board has an excellent web site with more data, analysis and research papers. So does the Congressional Budget Office. There is absolutely, positively no shortage of turning to economists with specific expertise in recessions for advice on coping with the recession.

Mr. Yglesias, it’s your job to advocate for the ThinkProgress agenda. But it’s the President’s job to synthesize the solutions from these thousands of economists and sell them to the American people. It also is his job to sell them to a collection of elected GOP local, state and Federal officials who want to see him voted out of office in November 2012. Recruiting industry leaders like Warren Buffett and Alan Mulally for advice, validation, salesmanship — whatever they can offer to help Washington restore sustainable economic growth — is a damn good idea.

 

About Data Journalism Developer Studio


In all the technology news last week, you might have missed this story. I only saw it mentioned on Reuters, not on any of the major technology blogs that I read. As is my usual practice when I see a technology story that matches my interests, I try to locate the original sources and post links on Twitter. So in case you missed those, here they are:

LinkedIn shares were a bubble: academic model | Reuters http://meb.tw/iNiM8R
@znmeb
M. Edward Borasky
Is There a Bubble in LinkedIn's Stock Price?http://meb.tw/loYBD3 [pdf]
@znmeb
M. Edward Borasky

There’s a fair amount of technical detail about the model in the paper cited in my second tweet. If you want even more, the model itself is documented here:

How to Detect an Asset Bubble by Robert Jarrow, Younes Kchia, Philip Protter :: SSRN http://meb.tw/iqvwUQ

So what’s the story here? From “Is There a Bubble in LinkedIn’s Stock Price?”:

It has been well documented in the financial press that a methodology is needed that can identify an asset price bubble in real time. William Dudley, the President of the New York Federal Reserve, in an interview with Planet Money [3] stated “…what I am proposing is that we try to identify bubbles in real time, try to develop tools to address those bubbles, try to use those tools when appropriate to limit the size of those bubbles and, therefore, try to limit the damage when those bubbles burst.”

It is also widely recognized that this is not an easy task. Indeed, in 2009 the Federal Reserve Chairman Ben Bernanke said in Congressional Testimony [1] “It is extraordinarily difficult in real time to know if an asset price is appropriate or not”.

Here’s a link to the William Dudley interview, and one to Bernanke’s testimony.

Professor Jarrow and his colleagues took up the challenge laid down by the Federal Reserve Board. The model they have devised is quite complex, involving stochastic differential equations and reproducing kernel Hilbert spaces. They tested this model on stock price data from “the alleged internet dotcom bubble (and beyond), from 1999 to 2005.” While there will no doubt be much more peer review of the data, model and conclusions, the test shows promise. Moreover, it can be applied to the price of any publicly-traded stock. The test has three possible results:

  1. There’s definitely a bubble.
  2. There’s definitely not a bubble.
  3. No conclusion about a bubble can be drawn from the data.

So now we come to LinkedIn. LinkedIn was publicly traded for the first time on May 19, 2011, using the symbol LNKD. Professor Jarrow and his colleagues obtained real-time price data from Bloomberg for the first four days of trading and applied their model. And their claim is quite definitive:

We have found, definitively, that there is a price bubble!

While the technology is certainly interesting in its own right, at least to data journalists like myself, what are the wider implications of this? First of all, the context of the Dudley interview was the Finance / Insurance / Real Estate (FIRE) sector and the holdings of the Federal Reserve Board in that industry. As we all know, the Great Recession we discuss on a daily basis originated in the FIRE sector.

The context of the model Jarrow, et. al., have created, on the other hand, is publicly-traded stocks. In particular, the model was initially tested on Internet stocks during a well-documented bubble, and applied to a social media stock within days of its initial public offering. Moreover, the model should work in real time. Given a live data feed and enough computing capacity, it should be possible to monitor data and make investment decisions in real time.

Even though the model is designed for real-time publicly-traded stocks, it should be applicable to any financial time series that satisfies the underlying mathematical assumptions. This includes, for example, prices of shares in the “secondary markets” for companies like Facebook and Twitter. I haven’t attempted to implement the model yet – I’ve been away from computational finance for several years and I’m in the process of coming back up to speed on the methodologies. The core technologies are available in the Data Journalism Developer Studio, however, and if anyone is interested in working on this, send me a tweet @znmeb.

 

I’ve just pushed release 1.0.0 of the Data Journalism Developer Studio into the SUSE Gallery. Changes:

  • The base appliance ships with Mozilla Firefox as the browser rather than Chromium. Chromium is available as an add-on installation script set. This was a difficult decision for me to make, but the version of Chromium in the Open Build Service is 13.0.xxx, which is updated frequently and can be unstable. This is roughly equivalent to Google’s “Canary” build on Windows and Macintosh. Chromium was proving too unstable for regular use, so I replaced it with Firefox.
  • I added CoffeeScript to the install scripts for node.js and NowJS. If you’re a JavaScript developer, I welcome more suggestions for node.js packages.

I’m planning to open the project up to other developers in the near future. Now that the Fundry feature request mechanism is in place, the road map is public. My own plan is to start building user-level documentation. Most of the software in the appliance is well-documented on its own, but there aren’t too many examples of application-level usage that I’ve been able to find.

Powered by Fundry

 

I’m really conflicted about this. On the one hand, I know Twitter needs to sell advertising, and web services need to promote themselves. And yes, this is a real news event, not a manufactured story. But I wonder – are we heading back to the days of “Yellow Journalism” in the tweet stream? Please comment below.

© 2011 Borasky Research Journal Suffusion theme by Sayontan Sinha