Borasky Research Journal Google+ Page

Borasky Research Journal Amazon Store


Data Journalism Developer Studio 2012LX

 

Data Journalism Developer Studio 2012 Overview

Download Data Journalism Developer Studio 2012 From SUSE Gallery

Data Journalism Developer Studio 2012 on Github

Data Journalism Developer Studio 2012 Blog


I’ve just released Data Journalism Developer Studio 2012. This is a major refactoring of the code base. The major user-visible changes are:

  1. I’ve removed RStudio Server for the time being. It was redundant for most users, and removing it freed up over 100 MB on the released appliances. I do plan to put an installer script for it on the appliance at a later date.
  2. Given the availability of a big chunk of space, I was able to move some frequently-used packages out of the options and into the released appliance. They are
    1. The R Commander GUI. This turns R into a spreadsheet-like user interface. I’ve included the Text Mining plugin as well.
    2. Google Refine. This is another spreadsheet-like tool for working with messy data. The Tesseract Optical Character Recognition package is also included.
    3. Maqetta. This is a WYSIWYG HTML5 user interface builder based on the Dojo JavaScript libraries.
    4. The Perl utilities are back in the main appliance.
  3. I’ve re-organized the install scripts slightly. The BARD re-districting mapping tool is now part of the Spatial task view, and the “beancounter” financial database tool is now part of the Finance task view.

There’s more coming in the next few weeks on the road map. I’ve been testing the Octopress lightweight blogging platform. It’s quite technical – it’s billed as a blogging platform for hackers, and that’s a pretty good description. It’s very lightweight, though, and it works with Github for painless deployment and version control. There will be a sample blog for the Data Journalism Developer Studio 2012 up on Github in a day or so.

Now that the Twitter Perl libraries are back in the main appliance, I’ll be putting my Twitter user and tweet CSV dump routines on the appliance. That way, you’ll be able to acquire tweets or user lists and process them from the appliance desktop.

 

 

About Data Journalism Developer Studio


In all the technology news last week, you might have missed this story. I only saw it mentioned on Reuters, not on any of the major technology blogs that I read. As is my usual practice when I see a technology story that matches my interests, I try to locate the original sources and post links on Twitter. So in case you missed those, here they are:

LinkedIn shares were a bubble: academic model | Reuters http://meb.tw/iNiM8R
@znmeb
M. Edward Borasky
Is There a Bubble in LinkedIn's Stock Price?http://meb.tw/loYBD3 [pdf]
@znmeb
M. Edward Borasky

There’s a fair amount of technical detail about the model in the paper cited in my second tweet. If you want even more, the model itself is documented here:

How to Detect an Asset Bubble by Robert Jarrow, Younes Kchia, Philip Protter :: SSRN http://meb.tw/iqvwUQ

So what’s the story here? From “Is There a Bubble in LinkedIn’s Stock Price?”:

It has been well documented in the financial press that a methodology is needed that can identify an asset price bubble in real time. William Dudley, the President of the New York Federal Reserve, in an interview with Planet Money [3] stated “…what I am proposing is that we try to identify bubbles in real time, try to develop tools to address those bubbles, try to use those tools when appropriate to limit the size of those bubbles and, therefore, try to limit the damage when those bubbles burst.”

It is also widely recognized that this is not an easy task. Indeed, in 2009 the Federal Reserve Chairman Ben Bernanke said in Congressional Testimony [1] “It is extraordinarily difficult in real time to know if an asset price is appropriate or not”.

Here’s a link to the William Dudley interview, and one to Bernanke’s testimony.

Professor Jarrow and his colleagues took up the challenge laid down by the Federal Reserve Board. The model they have devised is quite complex, involving stochastic differential equations and reproducing kernel Hilbert spaces. They tested this model on stock price data from “the alleged internet dotcom bubble (and beyond), from 1999 to 2005.” While there will no doubt be much more peer review of the data, model and conclusions, the test shows promise. Moreover, it can be applied to the price of any publicly-traded stock. The test has three possible results:

  1. There’s definitely a bubble.
  2. There’s definitely not a bubble.
  3. No conclusion about a bubble can be drawn from the data.

So now we come to LinkedIn. LinkedIn was publicly traded for the first time on May 19, 2011, using the symbol LNKD. Professor Jarrow and his colleagues obtained real-time price data from Bloomberg for the first four days of trading and applied their model. And their claim is quite definitive:

We have found, definitively, that there is a price bubble!

While the technology is certainly interesting in its own right, at least to data journalists like myself, what are the wider implications of this? First of all, the context of the Dudley interview was the Finance / Insurance / Real Estate (FIRE) sector and the holdings of the Federal Reserve Board in that industry. As we all know, the Great Recession we discuss on a daily basis originated in the FIRE sector.

The context of the model Jarrow, et. al., have created, on the other hand, is publicly-traded stocks. In particular, the model was initially tested on Internet stocks during a well-documented bubble, and applied to a social media stock within days of its initial public offering. Moreover, the model should work in real time. Given a live data feed and enough computing capacity, it should be possible to monitor data and make investment decisions in real time.

Even though the model is designed for real-time publicly-traded stocks, it should be applicable to any financial time series that satisfies the underlying mathematical assumptions. This includes, for example, prices of shares in the “secondary markets” for companies like Facebook and Twitter. I haven’t attempted to implement the model yet – I’ve been away from computational finance for several years and I’m in the process of coming back up to speed on the methodologies. The core technologies are available in the Data Journalism Developer Studio, however, and if anyone is interested in working on this, send me a tweet @znmeb.

 

I’ve just pushed release 1.0.0 of the Data Journalism Developer Studio into the SUSE Gallery. Changes:

  • The base appliance ships with Mozilla Firefox as the browser rather than Chromium. Chromium is available as an add-on installation script set. This was a difficult decision for me to make, but the version of Chromium in the Open Build Service is 13.0.xxx, which is updated frequently and can be unstable. This is roughly equivalent to Google’s “Canary” build on Windows and Macintosh. Chromium was proving too unstable for regular use, so I replaced it with Firefox.
  • I added CoffeeScript to the install scripts for node.js and NowJS. If you’re a JavaScript developer, I welcome more suggestions for node.js packages.

I’m planning to open the project up to other developers in the near future. Now that the Fundry feature request mechanism is in place, the road map is public. My own plan is to start building user-level documentation. Most of the software in the appliance is well-documented on its own, but there aren’t too many examples of application-level usage that I’ve been able to find.

Powered by Fundry

 

I’m really conflicted about this. On the one hand, I know Twitter needs to sell advertising, and web services need to promote themselves. And yes, this is a real news event, not a manufactured story. But I wonder – are we heading back to the days of “Yellow Journalism” in the tweet stream? Please comment below.

 

 According to Mashable, “Kraft Looks to Reward Twitter Users Who Tweet About Mac & Cheese“,

Under a new program quietly rolled out over the past few weeks, any time two people individually use the phrase “mac & cheese” in a tweet, they’ll each get a link pointing out the “Mac & Jinx.” The first one to click the link and give Kraft his or her address gets five free boxes of Kraft’s mac and cheese and a T-shirt.

It seems that “Mac & Cheese” is now a Trending Topic, as of 2011-03-08 19:12 UTC. But when you click on the topic, you see this Promoted Tweet:

What could be worse? Alyssa Milano, who has 1,403,372 followers, posted this tweet:

This could get interesting. 

Update: it has gotten interesting. @WootLive has gotten into the act.

Update: FriendsEAT has tweeted about capacity issues stemming from their article.

Oh, by the way — the Kraft campaign that started this whole thing is being run by Crispin Porter + Bogusky. Does that name sound familiar? It’s the same agency that came up with the GroupOn Super Bowl ads about Tibet and seafood curry.

 

Update 2011-03-20

For a variety of reasons, I have replaced the Social Media Analytics Research Toolkit, Code Like A Pirate and Project Kipling with a new, modular appliance called the Data Journalism Developer Studio. All of the software found in those three appliances can be installed via scripts provided in the new appliance. Links:


Upon careful reading of Twitter’s API Terms of Service, I have decided to temporarily remove two appliances from the SUSE Studio Gallery. Those two appliances are the Social Media Analytics Research Toolkit (SMART@znmeb) and Project Kipling Real-Time Data Journalism Tools. I do intend to put them back on line at some point in the future, but I do not at this time know when they will be back, because I haven’t determined the scope of required changes to the appliances or their marketing materials. Why? These two appliances may be in violation of item 4.A. below:

4. You will not attempt or encourage others to:

A. sell, rent, lease, sublicense, redistribute, or syndicate the Twitter API or Twitter Content to any third party for such party to develop additional products or services without prior written approval from Twitter;

B. remove or alter any proprietary notices or marks on the Twitter API or Twitter Content;

C. use or access the Twitter API for purposes of monitoring the availability, performance, or functionality of any of Twitter’s products and services or for any other benchmarking or competitive purposes; or

D. use Twitter Marks as part of the name of your company or Service, or in any product, service, or logos created by you. You may not use Twitter Marks in a manner that creates a sense of endorsement, sponsorship, or false association with Twitter. All use of Twitter Marks, and all goodwill arising out of such use, will inure to Twitter’s benefit.

E. use or access the Twitter API to aggregate, cache (except as part of a Tweet), or store place and other geographic location information contained in Twitter Content.

While I don’t encourage people to redistribute Twitter data, the appliances do have the ability to collect Twitter data and I can’t prevent them from redistributing it. I want to emphasize that Twitter has not asked me to take these appliances down! I don’t know that they violate the letter of item 4.C., but I think they violate the spirit of that clause, so I am removing them until I can determine in what form they are viable products.

 

Data Journalism Developer Studio 2012LX Blog


Disclosure

As you probably know, I live in the Portland, Oregon area and have for many years. One of the must-visit places here is Powell’s Books. The book links in this post will all take you to Powell’s as part of their Partner Program. If you’d like to join the program too, here’s the link.

Updated September 11, 2011: I recently purchased a Kindle, and two of these books are now available in that format. For those two, I’ve tweeted my Amazon Affiliate links out and have embedded those tweets here.

Prerequisite Software

To get the most out of these books, you will need to install some software. You will need Mondrian, R, GGobi, and the ggplot2, rggobi and DescribeDisplay R packages. All of these will run on a Windows, Macintosh or Linux desktop / laptop, including most netbooks. And they are all free, open source software. An easy way to get them all, packaged in an openSUSE Linux appliance, is to download Data Journalism Developer Studio 2012.

Ggplot2: Elegant Graphics for Data Analysis (Use R)

ggplot2 is an advanced graphics package for the R programming language. It is based on the grammar of graphics (Grammar of Graphics 2ND Edition). ggplot2 generates the most beautiful static graphics I’ve ever seen. You can use ggplot2 at any stage of your analysis. Simple exploratory plots can be made with a simple call to the “qplot” function, and when you’re ready to create a final report or presentation, you can get publication-quality graphics.

The two things I like most about the ggplot2 package are

  • The absolutely stunning visual appeal of the plots it produces: Dr. Wickham has paid great attention to the visual aspects of the output. I don’t know of another package in any language that generates such beautiful plots.
  • The numerous built-in analysis methods: Boxplots, kernel and quantile regression and smoothing, faceted plots – all are “standard equipment” with ggplot2.
ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham http://t.co/GfXsOJm via @
@znmeb
M. Edward Borasky

Interactive Graphics for Data Analysis: Principles and Examples

This book is a complete course in interactive graphics for data analysis. It is mostly based on the Mondrian interactive statistical data visualization system, although there is some use of R as well. The first part covers the basic tools, and the second part gives case studies.

The case studies really are the best part of the book. They cover geographical analysis, some interesting history from the sinking of the Titanic and the 2004 Florida election. As I note below, there is some overlap in tools between Mondrian and GGobi, but you really need both books and both packages to be able to do everything.

Interactive Graphics for Data Analysis: Principles and Examples (Chapman & Hall/CRC C... by Martin Theus http://t.co/6XbdVWU via @
@znmeb
M. Edward Borasky

Interactive and Dynamic Graphics for Data Analysis: With R and Ggobi (Use R)

As the title implies, this book is also a complete course in data analysis using interactive graphics. But the focus here is on R and GGobi rather than Mondrian. While there is some overlap in the tools, there are some things Mondrian does that GGobi doesn’t do, and vice versa. A partial list:

  • Geographic datasets: Mondrian only
  • Mosaic plots: Mondrian only
  • Classification: GGobi only
  • Clustering: GGobi only
  • Social network graphs: GGobi only

In addition, GGobi integrates directly with R and ggplot2 via the rggobi and DescribeDisplay packages. There are some integration points between R and Mondrian, but that integration isn’t as tight as it is with R, GGobi and ggplot2.

Interactive and Dynamic Graphics for Data Analysis: With R and GGobi (Use R) by Dianne Cook http://t.co/TxTU5ov via @
@znmeb
M. Edward Borasky
 

Data Journalism Developer Studio 2012LX Blog


Topic Detection and Tracking by James Allan is now available for Kindle. Here’s a tweet with my affiliate link so you can purchase it.

Topic Detection and Tracking by James Allan http://t.co/vvQbtwf7
@znmeb
M. Edward Borasky

Neal Rauhauser (@StrandedWind) pointed out that the links to Twitter documentation were broken. They should be fixed now. Thanks, Neal!


In all the excitement over the Google Nexus One announcement, another announcement came yesterday that hasn’t received a lot of attention. Twitter announced that the streaming API had graduated to production status. So what is this streaming API?

There are actually three Twitter Application Programming Interfaces (APIs). The original API, sometimes called the REpresentational State Transfer (REST) API, covers the basic Twitter functions. You can read your timeline and direct messages, tweet and retweet from the REST API, follow and unfollow other users, send direct messages (DMs), manipulate your lists, and so on.

The second is the search API. The search API duplicates what the Twitter search page does. It can do everything that the Twitter Advanced Search can do, and nearly all the Twitter monitoring tools depend on this Twitter search API.

The third API is the streaming API, the one that was released into production yesterday. It has been in alpha test since April of 2009. Before describing the “new” API and its significance, let’s look at the basic tweet flow for declarative statements.

Tweet Flow Schematic Diagram
Tweet Flow Schematic Diagram

The basic flow of tweets is as follows:

  1. A user publishes a tweet from a mobile, laptop or desktop device, using a Twitter client. The tweet is tagged with a time stamp, the name of the user that sent it, the user it was sent to if it was a reply, and, if the user has enabled geotagging, the user’s location. In other words, the tweet is a statement: “I am here now, and this is what’s happening.” The tweet is either sent to a specific user (an “@reply”) or broadcast to the world.
  2. The tweet enters the Twitter infrastructure. First, it goes into the main database. Once it is in the main database, it is also sent into a user quality filter. The quality filter is designed to remove less relevant users — users that Twitter deems to be of low quality are filtered out.If the user quality filter passes, the tweet is forwarded into a second filter, the relevance and ranking filter. If this filter also passes, the tweet is sent to the search database, to be indexed for Twitter search. See http://dev.twitter.com/pages/streaming_api_concepts#result-quality and http://help.twitter.com/forums/10713/entries/42646 for more details about the quality filters.

The top arrow on the right represents what you see when you use the REST API. When you access the REST API, you are running a query against the main database. That is, you are creating, reading, updating or deleting tweets from the main database. Like tweets that come from a Twitter client, tweets created in the main database via the REST API are sent to the quality filters.

The second arrow on the right represents the search API. When you access the search API, you are looking up tweets from the past in the search database. This database is read-only; you can’t create, update or delete tweets from it with the search API.

The bottom two arrows represent the streaming API. Like the search API, the streaming API is read-only. Unlike the search API, the streaming API is not buffered through the search database. The raw tweets pass through the same user quality filter as used to qualify tweets for the search database, but they do not go through the relevance and ranking filter. Thus, there are more tweets available to users of the streaming API than there are to users of the search API.

Then they split into two streams, the sample stream and the filter stream. The sample stream is a subset of the raw public timeline, popularly known as the Firehose. The sampling process serves to limit the processing and bandwidth capacity Twitter must provide to publish the stream, and also the processing and bandwidth capacity a subscriber must provide to accept the stream. A subscriber simply reads the sample stream directly from Twitter.

The filter stream is also a subset of the Firehose. Like the sample stream, a subscriber simply reads the stream directly from Twitter. However, as the name implies, the subscriber can filter the tweets coming out of the stream. The stream can be filtered by keywords, lists of users that created the tweets, or locations of geotagged tweets.

Why is this a big deal? There are a number of reasons, but the one that excites me the most is automated real-time journalism, using a technique called Topic Detection and Tracking (TDT). How does this work? Here’s a simple scenario:

1. The topic detection and tracking server connects to the sample stream and monitors tweets as they arrive. The sample stream is a representative sample in near real time of all the tweets being published from all over the world. After a certain period of time, the monitoring process will have established a baseline of common words and hashtags that it sees, and when and where “normal” tweets are published on average.

2. Now suppose an event occurs — for example, an ambulance arrives at the home of a celebrity. A crowd gathers and starts tweeting about it. Because of the sampling, not all of these tweets will show up in the sample stream. But if the event is newsworthy enough, word will spread, the initial reports will get retweeted, and the monitoring process will notice new words — the name of the celebrity, and perhaps some hashtags that have been created. And enough of the tweets will be geotagged to get a fix on the location of the event, even if nobody tweets about it.

3. When the monitoring process sees an event, it can then initiate a backsearch, using the search API, to collect more details about the event. The backsearch should retrieve all the tweets about the event except those that have been filtered out by Twitter’s search quality filter. This can pinpoint fairly closely the time at which the event first entered the tweet stream, and delivers a list of users tweeting about the event.

4. A filter can now be constructed and a filter stream set up with keywords, location and users. In addition, using the REST API, all of the tweets from these users can be retrieved, in case some were missed by the other processes. Tweets of users blocked by the quality filter can be retrieved via the REST API if their names were discovered in a previous step.

Using these processes, a complete “bird’s eye view” of the event can be constructed. The process can create Twitter lists of the users using the REST API, send alerts, publish an RSS feed about the event, and even do some more sophisticated natural language processing.

For more information about the streaming API, see http://dev.twitter.com/pages/streaming_api. A detailed description of the mathematics of Topic Detection and Tracking can be found at Semantic Classes in Topic Detection and Tracking.

© 2011 Borasky Research Journal Suffusion theme by Sayontan Sinha