Topic Detection and Tracking by James Allan is now available for Kindle. Here’s a tweet with my affiliate link so you can purchase it.
Neal Rauhauser (@StrandedWind) pointed out that the links to Twitter documentation were broken. They should be fixed now. Thanks, Neal!
In all the excitement over the Google Nexus One announcement, another announcement came yesterday that hasn’t received a lot of attention. Twitter announced that the streaming API had graduated to production status. So what is this streaming API?
There are actually three Twitter Application Programming Interfaces (APIs). The original API, sometimes called the REpresentational State Transfer (REST) API, covers the basic Twitter functions. You can read your timeline and direct messages, tweet and retweet from the REST API, follow and unfollow other users, send direct messages (DMs), manipulate your lists, and so on.
The second is the search API. The search API duplicates what the Twitter search page does. It can do everything that the Twitter Advanced Search can do, and nearly all the Twitter monitoring tools depend on this Twitter search API.
The third API is the streaming API, the one that was released into production yesterday. It has been in alpha test since April of 2009. Before describing the “new” API and its significance, let’s look at the basic tweet flow for declarative statements.

Tweet Flow Schematic Diagram
The basic flow of tweets is as follows:
- A user publishes a tweet from a mobile, laptop or desktop device, using a Twitter client. The tweet is tagged with a time stamp, the name of the user that sent it, the user it was sent to if it was a reply, and, if the user has enabled geotagging, the user’s location. In other words, the tweet is a statement: “I am here now, and this is what’s happening.” The tweet is either sent to a specific user (an “@reply”) or broadcast to the world.
- The tweet enters the Twitter infrastructure. First, it goes into the main database. Once it is in the main database, it is also sent into a user quality filter. The quality filter is designed to remove less relevant users — users that Twitter deems to be of low quality are filtered out.If the user quality filter passes, the tweet is forwarded into a second filter, the relevance and ranking filter. If this filter also passes, the tweet is sent to the search database, to be indexed for Twitter search. See http://dev.twitter.com/pages/streaming_api_concepts#result-quality and http://help.twitter.com/forums/10713/entries/42646 for more details about the quality filters.
The top arrow on the right represents what you see when you use the REST API. When you access the REST API, you are running a query against the main database. That is, you are creating, reading, updating or deleting tweets from the main database. Like tweets that come from a Twitter client, tweets created in the main database via the REST API are sent to the quality filters.
The second arrow on the right represents the search API. When you access the search API, you are looking up tweets from the past in the search database. This database is read-only; you can’t create, update or delete tweets from it with the search API.
The bottom two arrows represent the streaming API. Like the search API, the streaming API is read-only. Unlike the search API, the streaming API is not buffered through the search database. The raw tweets pass through the same user quality filter as used to qualify tweets for the search database, but they do not go through the relevance and ranking filter. Thus, there are more tweets available to users of the streaming API than there are to users of the search API.
Then they split into two streams, the sample stream and the filter stream. The sample stream is a subset of the raw public timeline, popularly known as the Firehose. The sampling process serves to limit the processing and bandwidth capacity Twitter must provide to publish the stream, and also the processing and bandwidth capacity a subscriber must provide to accept the stream. A subscriber simply reads the sample stream directly from Twitter.
The filter stream is also a subset of the Firehose. Like the sample stream, a subscriber simply reads the stream directly from Twitter. However, as the name implies, the subscriber can filter the tweets coming out of the stream. The stream can be filtered by keywords, lists of users that created the tweets, or locations of geotagged tweets.
Why is this a big deal? There are a number of reasons, but the one that excites me the most is automated real-time journalism, using a technique called Topic Detection and Tracking (TDT). How does this work? Here’s a simple scenario:
1. The topic detection and tracking server connects to the sample stream and monitors tweets as they arrive. The sample stream is a representative sample in near real time of all the tweets being published from all over the world. After a certain period of time, the monitoring process will have established a baseline of common words and hashtags that it sees, and when and where “normal” tweets are published on average.
2. Now suppose an event occurs — for example, an ambulance arrives at the home of a celebrity. A crowd gathers and starts tweeting about it. Because of the sampling, not all of these tweets will show up in the sample stream. But if the event is newsworthy enough, word will spread, the initial reports will get retweeted, and the monitoring process will notice new words — the name of the celebrity, and perhaps some hashtags that have been created. And enough of the tweets will be geotagged to get a fix on the location of the event, even if nobody tweets about it.
3. When the monitoring process sees an event, it can then initiate a backsearch, using the search API, to collect more details about the event. The backsearch should retrieve all the tweets about the event except those that have been filtered out by Twitter’s search quality filter. This can pinpoint fairly closely the time at which the event first entered the tweet stream, and delivers a list of users tweeting about the event.
4. A filter can now be constructed and a filter stream set up with keywords, location and users. In addition, using the REST API, all of the tweets from these users can be retrieved, in case some were missed by the other processes. Tweets of users blocked by the quality filter can be retrieved via the REST API if their names were discovered in a previous step.
Using these processes, a complete “bird’s eye view” of the event can be constructed. The process can create Twitter lists of the users using the REST API, send alerts, publish an RSS feed about the event, and even do some more sophisticated natural language processing.
For more information about the streaming API, see http://dev.twitter.com/pages/streaming_api. A detailed description of the mathematics of Topic Detection and Tracking can be found at Semantic Classes in Topic Detection and Tracking.