Borasky Research Journal Google+ Page
Borasky Research Journal Amazon Store
 
Goodbye and thanks for all the animated GIFs http://twitpic.com/5p6pn1
@znmeb
M. Edward Borasky

 

I’ve just pushed release 1.0.0 of the Data Journalism Developer Studio into the SUSE Gallery. Changes:

  • The base appliance ships with Mozilla Firefox as the browser rather than Chromium. Chromium is available as an add-on installation script set. This was a difficult decision for me to make, but the version of Chromium in the Open Build Service is 13.0.xxx, which is updated frequently and can be unstable. This is roughly equivalent to Google’s “Canary” build on Windows and Macintosh. Chromium was proving too unstable for regular use, so I replaced it with Firefox.
  • I added CoffeeScript to the install scripts for node.js and NowJS. If you’re a JavaScript developer, I welcome more suggestions for node.js packages.

I’m planning to open the project up to other developers in the near future. Now that the Fundry feature request mechanism is in place, the road map is public. My own plan is to start building user-level documentation. Most of the software in the appliance is well-documented on its own, but there aren’t too many examples of application-level usage that I’ve been able to find.

Powered by Fundry

 

I’m really conflicted about this. On the one hand, I know Twitter needs to sell advertising, and web services need to promote themselves. And yes, this is a real news event, not a manufactured story. But I wonder – are we heading back to the days of “Yellow Journalism” in the tweet stream? Please comment below.

 

 According to Mashable, “Kraft Looks to Reward Twitter Users Who Tweet About Mac & Cheese“,

Under a new program quietly rolled out over the past few weeks, any time two people individually use the phrase “mac & cheese” in a tweet, they’ll each get a link pointing out the “Mac & Jinx.” The first one to click the link and give Kraft his or her address gets five free boxes of Kraft’s mac and cheese and a T-shirt.

It seems that “Mac & Cheese” is now a Trending Topic, as of 2011-03-08 19:12 UTC. But when you click on the topic, you see this Promoted Tweet:

What could be worse? Alyssa Milano, who has 1,403,372 followers, posted this tweet:

This could get interesting. 

Update: it has gotten interesting. @WootLive has gotten into the act.

Update: FriendsEAT has tweeted about capacity issues stemming from their article.

Oh, by the way — the Kraft campaign that started this whole thing is being run by Crispin Porter + Bogusky. Does that name sound familiar? It’s the same agency that came up with the GroupOn Super Bowl ads about Tibet and seafood curry.

 

Update 2011-03-20

For a variety of reasons, I have replaced the Social Media Analytics Research Toolkit, Code Like A Pirate and Project Kipling with a new, modular appliance called the Data Journalism Developer Studio. All of the software found in those three appliances can be installed via scripts provided in the new appliance. Links:


Upon careful reading of Twitter’s API Terms of Service, I have decided to temporarily remove two appliances from the SUSE Studio Gallery. Those two appliances are the Social Media Analytics Research Toolkit (SMART@znmeb) and Project Kipling Real-Time Data Journalism Tools. I do intend to put them back on line at some point in the future, but I do not at this time know when they will be back, because I haven’t determined the scope of required changes to the appliances or their marketing materials. Why? These two appliances may be in violation of item 4.A. below:

4. You will not attempt or encourage others to:

A. sell, rent, lease, sublicense, redistribute, or syndicate the Twitter API or Twitter Content to any third party for such party to develop additional products or services without prior written approval from Twitter;

B. remove or alter any proprietary notices or marks on the Twitter API or Twitter Content;

C. use or access the Twitter API for purposes of monitoring the availability, performance, or functionality of any of Twitter’s products and services or for any other benchmarking or competitive purposes; or

D. use Twitter Marks as part of the name of your company or Service, or in any product, service, or logos created by you. You may not use Twitter Marks in a manner that creates a sense of endorsement, sponsorship, or false association with Twitter. All use of Twitter Marks, and all goodwill arising out of such use, will inure to Twitter’s benefit.

E. use or access the Twitter API to aggregate, cache (except as part of a Tweet), or store place and other geographic location information contained in Twitter Content.

While I don’t encourage people to redistribute Twitter data, the appliances do have the ability to collect Twitter data and I can’t prevent them from redistributing it. I want to emphasize that Twitter has not asked me to take these appliances down! I don’t know that they violate the letter of item 4.C., but I think they violate the spirit of that clause, so I am removing them until I can determine in what form they are viable products.

 

Data Journalism Developer Studio 2012LX Blog


Update 2011-03-20

For a variety of reasons, I have replaced the Social Media Analytics Research Toolkit, Code Like A Pirate and Project Kipling with a new, modular appliance called the Data Journalism Developer Studio. All of the software found in those three appliances can be installed via scripts provided in the new appliance. Links:


I’ve just released version 2.0.0 of the Social Media Analytics Research Toolkit. In addition to the usual updating of packages to the most recent versions, SMART@znmeb now has a sentiment analysis library! The library is an R package that I discovered on Twitter just today called textir. Textir is a “set of tools for inference about text and associated speaker/document sentiment,” created by Assistant Professor of Econometrics and Statistics and Robert L. Graves Faculty Fellow Matt Taddy of the University of Chicago Booth School of Business.

If you’re interested in the mathematics behind this package, Professor Taddy has posted a document to Archiv.org, titled “Inverse Regression for Analysis of Sentiment in Text.” Three sample problems and their solutions are described in the paper: ideology in political speeches, on-line restaurant reviews and business news and stock performance. I’m excited to have this package available.

The political speech and restaurant review datasets are included with the library, but I couldn’t find the business news data set. I’m also adding the package to Project Kipling, but it will be a day or so before I get that build completed. I’ve been wanting a sentiment analysis capability in the appliances for quite some time, but haven’t been able to find an open source package until now.

One final note: 2.0.0 will be the last release of the larger appliance from 2010, Code Like A Pirate. Everything in that appliance and more can be found in Project Kipling, and there have been so few downloads of it that there’s no point in duplicating the effort and taking up the extra disk space. I’m going to leave the Open Virtualization Format (OVF) file up on SUSE Studio for Code Like A Pirate 2.0.0, but all the other builds will be removed and no more will be done.

 

Update 2011-03-20

For a variety of reasons, I have replaced the Social Media Analytics Research Toolkit, Code Like A Pirate and Project Kipling with a new, modular appliance called the Data Journalism Developer Studio. All of the software found in those three appliances can be installed via scripts provided in the new appliance. Links:

 

Data Journalism Developer Studio 2012 Overview

Download Data Journalism Developer Studio 2012 From SUSE Gallery

Data Journalism Developer Studio on Github

Data Journalism Developer Studio 2012 Blog


The rise of Twitter has made large quantities of text available in multiple languages. As a result, text processing, text analytics and other natural language processing techniques have become a staple in business intelligence. So I’ve put together a list of what I think are the essential references in the area. I’ve attempted to arrange them in order of increasing mathematical sophistication. And, as always, I’ve provided Powell’s Partner Program links so you can buy them.

Most of the algorithms described in these books are available in the Data Journalism Developer Studio. For a complete description of the toolkit, see About The Data Journalism Developer Studio.


Even if you’re not a Python programmer, this book is probably the best place to start. The book will walk you through the Python language, and it’s written by the experts on the Python Natural Language Tool Kit. The Python Natural Language Toolkit is one of the featured components of my Social Media Analytics Research Toolkit.


For Perl programmers, this book is a good place to start. Topics include pattern matching, data structures, probability, information retrieval, corpus linguistics, multivariate statistics, clustering and an introduction to R programming. Both Perl and R are available in the Social Media Analytics Research Toolkit.


After you’ve gotten started, this book will give you a good overview of the more technical and mathematical aspects of natural language processing. Topics include classical approaches, empirical and statistical approaches and applications. Machine translation, speech recognition, information retrieval, question answering, ontology construction and sentiment analysis are all covered.

The chapter on sentiment analysis is particularly well done. It covers most current techniques and includes sections on dealing with spam. Sentiment analysis is still somewhat controversial, although nearly all social media monitoring providers include it in some form. This chapter provides much-needed clarity on just what is and isn’t possible in sentiment analysis. Chapter 13, “Normalized Web Distance and Word Similarity”, is also notable. It describes the algorithms used in the CompLearn suite of programs that are part of the Social Media Analytics Research Toolkit.


This isn’t strictly a book about either text processing or natural language processing, but I’ve included it for three reasons:

  1. It covers all of the matrix decompositions one would use in text processing and natural language processing.
  2. It covers algorithms for social graph analysis.
  3. It has a very readable introduction to using tensors – arrays with more than two dimensions – in data mining. My opinion is that tensor-based algorithms are the future of natural language processing in general and text analytics in particular.

While also not totally about text / natural language processing, this book is an excellent overview of the technologies used in counterterrorism. There’s not as much technical detail as there is in the other books – you’ll need to go following the references. I’ve included this book because I see great potential for some of the technologies in business intelligence. For example, mining data for people in a social media site who “should” be friends or followers but aren’t is one technique businesses could “borrow” from law enforcement.


This book is an excellent overview of some of the more recent research in text mining. It includes chapters on “Detection of Bias in Media Outlets with Statistical Learning Methods”, “Topic Models”, “Utility-Based Information Distillation” and “Adaptive Information Filtering”. But in my opinion one chapter, “Nonnegative Matrix and Tensor Factorization for Discussion Tracking”, justifies the purchase of the book on its own.

Much of modern text processing depends on linear algebra over so-called “bag of words” vector space models. In such a model, keywords are extracted from the text and a collection of documents — called a corpus — is represented by arrays of keyword frequencies. In these models, a matrix is a two-dimensional array of the frequencies, usually indexed by keywords for rows and documents or document authors by columns.


Latent Semantic Analysis, sometimes called Latent Semantic Indexing, is a common technique in natural language processing, and this book explores it in both mathematical and practical detail. There is also a chapter on probabilistic topic modeling, sometimes called latent Dirichlet analysis. If you want to experiment with these techniques, I recommend the open source Java-language Mallet package. Mallet is included in the Social Media Analytics Research Toolkit, as are R language tools for latent semantic analysis.


Finally, we come to my current area of research, Topic Detection and Tracking. This book is the classic reference on the subject, and is required reading if you’re interested in automated journalism.


Appendix – R Language Natural Language Processing Task View

In addition to the Python Natural Language Tool Kit and Mallet, the Social Media Analytics Research Toolkit contains the R Natural Language Processing Task View. Here’s a copy of the contents of that task view as of 2010-08-23:

CRAN Task View: Natural Language Processing


Didn’t find what you’re looking for?

Click here to visit Powell's Books!

 

Update 2011-03-03

Upon careful reading of Twitter’s API Terms of Service, I have decided to delete this blog post. Specifically, it may be in violation of item 4.C. below:

4. You will not attempt or encourage others to:

A. sell, rent, lease, sublicense, redistribute, or syndicate the Twitter API or Twitter Content to any third party for such party to develop additional products or services without prior written approval from Twitter;

B. remove or alter any proprietary notices or marks on the Twitter API or Twitter Content;

C. use or access the Twitter API for purposes of monitoring the availability, performance, or functionality of any of Twitter’s products and services or for any other benchmarking or competitive purposes; or

D. use Twitter Marks as part of the name of your company or Service, or in any product, service, or logos created by you. You may not use Twitter Marks in a manner that creates a sense of endorsement, sponsorship, or false association with Twitter. All use of Twitter Marks, and all goodwill arising out of such use, will inure to Twitter’s benefit.

E. use or access the Twitter API to aggregate, cache (except as part of a Tweet), or store place and other geographic location information contained in Twitter Content.

I want to emphasize that Twitter has not asked me to take this blog post down! I don’t know that it violates the letter of item 4.C., but I think it violates the spirit of that clause, so I am removing it.

 

Download the Data Journalism Developer Studio from SUSE Gallery!


Thanks to Drew Conway (@drewconway), a PhD student at New York University, there are now eight excellent video tutorials on using the R language up on VCASMO. I think I should do a blog post about how magical VCASMO is and why they should be eating SlideShare’s lunch, but for now, I’ll just say

  • The presentations there are synchronized video and slides, and
  • They have an API.
  1. MATLAB/R Dictionary (Rosetta Stone Talk – 1/3)
    Presentation given by Harlan Harris at the NYC R Statistical Meetup on January 7, 2010.
  2. Learning R via Python…or the other way around (Rosetta Stone Talk – 2/3)
    Presentation given by Drew Conway at the NYC R Statistical Meetup on January 7, 2010.
  3. Data munging with SQL and R (Rosetta Stone Talk – 3/3)
    Presentation given by Josh Reich at the NYC R Statistical Meetup on January 7, 2010.
  4. Use Rapache: It Works!

    Presentations given at the Bay Area useR Group on January 10, 2010 by Jeff Horner, creator of the Rapache module, on “R-driven web applications”.
  5. Web Development with R
    Presentations given at the Bay Area useR Group on January 10, 2010 by Jeroen Ooms, on how to create web applications use R.
  6. Soical Network Analysis in R

    Presentation by Drew Conway on August 6, 2009 at the NYC R Statistical Programming Meetup on how to perform basic social network analysis in R using the igraph package.
  7. Introduction to the Grammar of Graphics with ggplot2 in R

    A detailed introduction to the Grammar of Graphics as implemented in R with the data visualization library ggplot2. This talk was given by Harlan Harris to the NYC R Statistical Meetup on December 3, 2009.
  8. Visualizing Data in R with ggplot2

    Drew Conway presents a brief talk on how to visualize data in R with ggplot2 at the NYC R Statistical Meetup on December 3, 2009.

In case you missed them, here is the Powell’s link for the ggplot2 book:

© 2011 Borasky Research Journal Suffusion theme by Sayontan Sinha