Borasky Research Journal Google+ Page

Borasky Research Journal Amazon Store


Data Journalism Developer Studio 2012LX

 

Data Journalism Developer Studio 2012LX Blog


By now, you’ve probably seen the reactions to Apple’s “education event” yesterday. My take is that it was 100% Apple marketing and zero “disrupting education.” It was all about selling overpriced tablets to schools that are struggling to keep teachers on the payroll. It was all about forcing authors to buy new Macintosh machines or upgrading existing ones to MacOS X “Lion”. And it was about a restrictive EULA for authors.

Textbooks should be free! That’s one way to disrupt education. And CK12.org provides Science, Technology, Engineering and Mathematics (STEM) textbooks for free. These are textbooks developed by educators, not marketers. They work on iPads, Kindles, PDF readers, or you can read them on line in your browser. There are authoring tools on the web site as well. The current CK-12 FlexBooks Library lists 38 mathematics textbooks, 34 in science and 20 in other subjects. Some have both student and teacher editions. Once you have an account, you can access the authoring and reformatting tools. I highly recommend doing this even if you only want to read or teach from the books.

And education software should be free! The most comprehensive collection of free educational software I’ve found is openSUSE Linux for Education – openSUSE:Education-Li-f-e. This is a LiveDVD that will boot on most PC-based hardware with at least 1 GB of RAM.. You don’t even need a hard drive – since it’s a Live DVD, Li-f-e doesn’t touch the hard drive unless you explictly direct it to do so. If you want, you can copy the DVD to a USB drive and boot from that. The directions for that are here.

Li-f-e is an absolutely stunning collection of software. It has the openSUSE 12.1 32-bit Linux operating system, the GNOME 3, KDE 4 and ultra-light IceWM desktops, desktop / productivity software, and a comprehensive collection of educational software for students ranging from pre-school all the way up into graduate school. It also has a complete Linux / Apache / MySQL / PHP (LAMP) server stack, a Linux Terminal Server Project (LTSP) server stack and a complete suite of professional software and web development tools. And the Scratch tools for teaching kids to program are there.

Given that these free tools exist, and have been around since well before the iPad, I don’t see how Apple marketing can claim to be disrupting education. There’s real disruption if you know where to look.

 

Data Journalism Developer Studio 2012LX Blog


I was on the webinar that introduced this book, along with thousands of others. Over 30,000 registered, and the count of people who attended was 10,899. If you had a Kindle, you could download this book for free. I did. In fact, if you’re an Amazon Prime member, you can still get the Kindle edition for free. Dan Zarrella has been collecting Twitter and Facebook data for some time now, and this book is the result of a careful study of what works, what doesn’t work, and what sometimes works. It’s an easy read and very useful.


Pulse is along the lines of the previous book, but is much more detailed, and talks about more data sources. Moreover, it’s research-oriented – how to do the sort of thing Dan Zarrella did to define his hierarchy of contagiousness.

Pulse covers Google search trends, Facebook connections, blogs and Twitter in some detail. THe themes are “what we surf, who we friend and what we say.” Chapter 8 describes three “potential pulses” – “where we go, what we buy and how we play”. You should think of this book as an overview – you’ll need to dig deeper if you want to implement any of the research described.


One of the recurring themes in the past two years has been the so-called “lean startup”. I have some skepticism about the concept, particularly the way it’s been described in blogs. So it’s refreshing to see a book like Venture Deals come out that’s full of actual meat, not just admonitions to “fail faster.” The authors were here in Portland a few weeks ago for a well-attended lecture on the contents, and I have a signed copy. If you’re starting a business, you need to read this book first.


I’ve covered this book at some length here, so I’ll defer you to the previous review. While it’s very advanced technically, it’s the best book published this year on trading technology. Most of the year’s other trading books are rehashes of decades-old technical analysis methods that may or may not work any more. If you want to be a trader, I highly recommend getting this book.


We heard a lot about economic inequality this year, and we’ll hear a lot more in 2012. Whether it’s Occupy Wall Street, calls for higher taxes on the wealthy by Warren Buffett, or President Obama’s speech in Osawatomie, Kansas, economic inequality and how to reverse it have become a topic of interest.

Changing Inequality is an easy-to-read book on the subject. It traces the causes of economic inequality over the past three decades, and suggests a few possible ways to reverse the trend. It should be noted, however, that reversing a 30-year trend of rising inequality isn’t easy. For example, as noted in the book,

In the late 1980s, when it first became clear that rapid increases in inequality were more than a short-term or cyclical phenomenon, researchers began to look for causes. It was almost a decade before widespread consensus was reached among economists that these changes were largely driven by skill-biased increases in demand, many of them probably the result of technological changes linked to a growing use of computer technologies.

The challenge for policy-makers is to devise policies that promote both growth and equality. Changing Inequality is a good place to start.

 

Data Journalism Developer Studio 2012LX Blog


The world of computational finance has changed dramatically since I first got interested in the underlying mathematics in 1982. We’ve seen events like the stock market crashes in 1987 and 1989, the failure of Long Term Capital Management in 1998, and more recently, the collapse of Lehman Brothers in September 2008 and the “Flash Crash” in May of 2010.

I’ve spent a fair amount of time over the past year catching up on the theory and practice of algorithmic trading. The following three books are the best I’ve found on the subject. Having made my way through them, I consider traditional technical analysis at best useless and at worst downright suicidal. They are expensive; if you can only afford one of them, I’d recommend the second, Asset Price Dynamics, Volatility, and Prediction by Stephen J. Taylor.


 

Financial Markets and Trading is the newest of these books, and is also the most expensive. It’s designed as a textbook at the undergraduate / graduate level and is fairly self-contained. Schmidt does cover a lot of ground, however, and for implementation details you’ll probably need to search out the original papers on the Internet.

What makes this book unique is

  • An extended section on high-frequency trading, including an overview of the May 2010 “Flash Crash”, and
  • A comprehensive chapter on testing technical trading rules.

These testing techniques go well beyond the traditional backtesting / optimization techniques that are well-known among traders. As this book and its references show, technical analysis sometimes works and sometimes it doesn’t. You’ll need these algorithms to know the difference.


 

 

As I noted above, if you can only afford one of these books, this is the one to get. Unique features include

  • Spreadsheet formulas for many of the algorithms,
  • Algorithms for extracting information from high-frequency data
  • Implied return density calculations from options prices

There are also some algorithms for testing technical trading rules, but I think Schmidt’s treatment of the subject is far more comprehensive.


 

This is the oldest book of the three, and probably the most theoretical. However, it provides much more detail on market microstructure models than the other two, and it includes a chapter on order execution timing strategies.


 

Data Journalism Developer Studio 2012LX Blog


Disclosure

As you probably know, I live in the Portland, Oregon area and have for many years. One of the must-visit places here is Powell’s Books. The book links in this post will all take you to Powell’s as part of their Partner Program. If you’d like to join the program too, here’s the link.

Updated September 11, 2011: I recently purchased a Kindle, and two of these books are now available in that format. For those two, I’ve tweeted my Amazon Affiliate links out and have embedded those tweets here.

Prerequisite Software

To get the most out of these books, you will need to install some software. You will need Mondrian, R, GGobi, and the ggplot2, rggobi and DescribeDisplay R packages. All of these will run on a Windows, Macintosh or Linux desktop / laptop, including most netbooks. And they are all free, open source software. An easy way to get them all, packaged in an openSUSE Linux appliance, is to download Data Journalism Developer Studio 2012.

Ggplot2: Elegant Graphics for Data Analysis (Use R)

ggplot2 is an advanced graphics package for the R programming language. It is based on the grammar of graphics (Grammar of Graphics 2ND Edition). ggplot2 generates the most beautiful static graphics I’ve ever seen. You can use ggplot2 at any stage of your analysis. Simple exploratory plots can be made with a simple call to the “qplot” function, and when you’re ready to create a final report or presentation, you can get publication-quality graphics.

The two things I like most about the ggplot2 package are

  • The absolutely stunning visual appeal of the plots it produces: Dr. Wickham has paid great attention to the visual aspects of the output. I don’t know of another package in any language that generates such beautiful plots.
  • The numerous built-in analysis methods: Boxplots, kernel and quantile regression and smoothing, faceted plots – all are “standard equipment” with ggplot2.
ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham http://t.co/GfXsOJm via @
@znmeb
M. Edward Borasky

Interactive Graphics for Data Analysis: Principles and Examples

This book is a complete course in interactive graphics for data analysis. It is mostly based on the Mondrian interactive statistical data visualization system, although there is some use of R as well. The first part covers the basic tools, and the second part gives case studies.

The case studies really are the best part of the book. They cover geographical analysis, some interesting history from the sinking of the Titanic and the 2004 Florida election. As I note below, there is some overlap in tools between Mondrian and GGobi, but you really need both books and both packages to be able to do everything.

Interactive Graphics for Data Analysis: Principles and Examples (Chapman & Hall/CRC C... by Martin Theus http://t.co/6XbdVWU via @
@znmeb
M. Edward Borasky

Interactive and Dynamic Graphics for Data Analysis: With R and Ggobi (Use R)

As the title implies, this book is also a complete course in data analysis using interactive graphics. But the focus here is on R and GGobi rather than Mondrian. While there is some overlap in the tools, there are some things Mondrian does that GGobi doesn’t do, and vice versa. A partial list:

  • Geographic datasets: Mondrian only
  • Mosaic plots: Mondrian only
  • Classification: GGobi only
  • Clustering: GGobi only
  • Social network graphs: GGobi only

In addition, GGobi integrates directly with R and ggplot2 via the rggobi and DescribeDisplay packages. There are some integration points between R and Mondrian, but that integration isn’t as tight as it is with R, GGobi and ggplot2.

Interactive and Dynamic Graphics for Data Analysis: With R and GGobi (Use R) by Dianne Cook http://t.co/TxTU5ov via @
@znmeb
M. Edward Borasky
 

Data Journalism Developer Studio 2012 Overview

Download Data Journalism Developer Studio 2012 From SUSE Gallery

Data Journalism Developer Studio on Github

Data Journalism Developer Studio 2012 Blog


Disclosure

As you probably know, I live in the Portland, Oregon area and have for many years. One of the must-visit places here is Powell’s Books. The book links in this post will all take you to Powell’s as part of their Partner Program. If you’d like to join the program too, here’s the link.

Updated September 7, 2011: I recently purchased a Kindle, and quite a few of these books are now available in that format. For those that are, I’ve tweeted my Amazon Affiliate links out and have embedded those tweets here.


Updated May 18, 2010: There is an interesting discussion happening on LinkedIn about this subject, and quite a few links to data mining and predictive analytics resources have been posted there. So here’s the link. Enjoy! @znmeb


I’m a big fan of the R programming language, especially for data visualization. The following books are all books that I own and recommend.


The Grammar of Graphics (Statistics and Computing) by Leland Wilkinson http://t.co/zEz5210 via @
@znmeb
M. Edward Borasky

ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham http://t.co/GfXsOJm via @
@znmeb
M. Edward Borasky

Interactive and Dynamic Graphics for Data Analysis: With R and GGobi (Use R) by Dianne Cook http://t.co/TxTU5ov via @
@znmeb
M. Edward Borasky

Graphics of Large Datasets: Visualizing a Million by Antony Unwin http://t.co/8PKQMNI via @
@znmeb
M. Edward Borasky

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Seco... by Jerome Friedman http://t.co/UjlmX1v via @
@znmeb
M. Edward Borasky

Modern Applied Statistics with S (Statistics and Computing) by W.N. Venables http://t.co/fxEeMPb via @
@znmeb
M. Edward Borasky

S Programming (Statistics and Computing) by William Venables http://t.co/ptFdyms via @
@znmeb
M. Edward Borasky

Software for Data Analysis: Programming with R (Statistics and Computing) by John Chambers http://t.co/N3ka81j via @
@znmeb
M. Edward Borasky

Quantile Regression (Econometric Society Monographs) by Roger Koenker http://t.co/a22nCrG via @
@znmeb
M. Edward Borasky



Finding Groups in Data
by Leonard Kaufman
Powells.com

Programming with Data
by John M. Chambers
Powells.com


 

Data Journalism Developer Studio 2012LX Blog


Topic Detection and Tracking by James Allan is now available for Kindle. Here’s a tweet with my affiliate link so you can purchase it.

Topic Detection and Tracking by James Allan http://t.co/vvQbtwf7
@znmeb
M. Edward Borasky

Neal Rauhauser (@StrandedWind) pointed out that the links to Twitter documentation were broken. They should be fixed now. Thanks, Neal!


In all the excitement over the Google Nexus One announcement, another announcement came yesterday that hasn’t received a lot of attention. Twitter announced that the streaming API had graduated to production status. So what is this streaming API?

There are actually three Twitter Application Programming Interfaces (APIs). The original API, sometimes called the REpresentational State Transfer (REST) API, covers the basic Twitter functions. You can read your timeline and direct messages, tweet and retweet from the REST API, follow and unfollow other users, send direct messages (DMs), manipulate your lists, and so on.

The second is the search API. The search API duplicates what the Twitter search page does. It can do everything that the Twitter Advanced Search can do, and nearly all the Twitter monitoring tools depend on this Twitter search API.

The third API is the streaming API, the one that was released into production yesterday. It has been in alpha test since April of 2009. Before describing the “new” API and its significance, let’s look at the basic tweet flow for declarative statements.

Tweet Flow Schematic Diagram
Tweet Flow Schematic Diagram

The basic flow of tweets is as follows:

  1. A user publishes a tweet from a mobile, laptop or desktop device, using a Twitter client. The tweet is tagged with a time stamp, the name of the user that sent it, the user it was sent to if it was a reply, and, if the user has enabled geotagging, the user’s location. In other words, the tweet is a statement: “I am here now, and this is what’s happening.” The tweet is either sent to a specific user (an “@reply”) or broadcast to the world.
  2. The tweet enters the Twitter infrastructure. First, it goes into the main database. Once it is in the main database, it is also sent into a user quality filter. The quality filter is designed to remove less relevant users — users that Twitter deems to be of low quality are filtered out.If the user quality filter passes, the tweet is forwarded into a second filter, the relevance and ranking filter. If this filter also passes, the tweet is sent to the search database, to be indexed for Twitter search. See http://dev.twitter.com/pages/streaming_api_concepts#result-quality and http://help.twitter.com/forums/10713/entries/42646 for more details about the quality filters.

The top arrow on the right represents what you see when you use the REST API. When you access the REST API, you are running a query against the main database. That is, you are creating, reading, updating or deleting tweets from the main database. Like tweets that come from a Twitter client, tweets created in the main database via the REST API are sent to the quality filters.

The second arrow on the right represents the search API. When you access the search API, you are looking up tweets from the past in the search database. This database is read-only; you can’t create, update or delete tweets from it with the search API.

The bottom two arrows represent the streaming API. Like the search API, the streaming API is read-only. Unlike the search API, the streaming API is not buffered through the search database. The raw tweets pass through the same user quality filter as used to qualify tweets for the search database, but they do not go through the relevance and ranking filter. Thus, there are more tweets available to users of the streaming API than there are to users of the search API.

Then they split into two streams, the sample stream and the filter stream. The sample stream is a subset of the raw public timeline, popularly known as the Firehose. The sampling process serves to limit the processing and bandwidth capacity Twitter must provide to publish the stream, and also the processing and bandwidth capacity a subscriber must provide to accept the stream. A subscriber simply reads the sample stream directly from Twitter.

The filter stream is also a subset of the Firehose. Like the sample stream, a subscriber simply reads the stream directly from Twitter. However, as the name implies, the subscriber can filter the tweets coming out of the stream. The stream can be filtered by keywords, lists of users that created the tweets, or locations of geotagged tweets.

Why is this a big deal? There are a number of reasons, but the one that excites me the most is automated real-time journalism, using a technique called Topic Detection and Tracking (TDT). How does this work? Here’s a simple scenario:

1. The topic detection and tracking server connects to the sample stream and monitors tweets as they arrive. The sample stream is a representative sample in near real time of all the tweets being published from all over the world. After a certain period of time, the monitoring process will have established a baseline of common words and hashtags that it sees, and when and where “normal” tweets are published on average.

2. Now suppose an event occurs — for example, an ambulance arrives at the home of a celebrity. A crowd gathers and starts tweeting about it. Because of the sampling, not all of these tweets will show up in the sample stream. But if the event is newsworthy enough, word will spread, the initial reports will get retweeted, and the monitoring process will notice new words — the name of the celebrity, and perhaps some hashtags that have been created. And enough of the tweets will be geotagged to get a fix on the location of the event, even if nobody tweets about it.

3. When the monitoring process sees an event, it can then initiate a backsearch, using the search API, to collect more details about the event. The backsearch should retrieve all the tweets about the event except those that have been filtered out by Twitter’s search quality filter. This can pinpoint fairly closely the time at which the event first entered the tweet stream, and delivers a list of users tweeting about the event.

4. A filter can now be constructed and a filter stream set up with keywords, location and users. In addition, using the REST API, all of the tweets from these users can be retrieved, in case some were missed by the other processes. Tweets of users blocked by the quality filter can be retrieved via the REST API if their names were discovered in a previous step.

Using these processes, a complete “bird’s eye view” of the event can be constructed. The process can create Twitter lists of the users using the REST API, send alerts, publish an RSS feed about the event, and even do some more sophisticated natural language processing.

For more information about the streaming API, see http://dev.twitter.com/pages/streaming_api. A detailed description of the mathematics of Topic Detection and Tracking can be found at Semantic Classes in Topic Detection and Tracking.

© 2011 Borasky Research Journal Suffusion theme by Sayontan Sinha