Borasky Research Journal Google+ Page
Borasky Research Journal Amazon Store
 

As many of you know, I’ve been doing a lot of research into social media analytics, especially in algorithms for text analysis of Twitter data. My focus has been on what the machine learning people call unsupervised learning. Why? Because I’ve come to the realization that tweets are an evolving language. They’re really a meta-language – a tweet could be about a web page, a blog, a picture, etc.

Tweets often aren’t complete sentences or other linguistic constructs as we know them. Twitter has @replies, hashtags, retweets, “follow friday”, trending topics, link shorteners, and a number of other new linguistic constructs that don’t appear in the natural human languages we use in everyday conversation.

There is now a project called Tweak-the-Tweet, TtT for short. TtT is a project of the University of Colorado at Boulder. Here’s the news release: CU Grad Student’s ‘Tweet’ Approach Streamlines Online Communications During Haiti Disaster.

As the story notes, Tweak-the-Tweet is helping Haiti relief efforts by providing standardized syntax for Twitter communications. I think this is very important. You can think of TtT as the Twitter equivalent of the telegraph and ham radio’s Morse code, or police and citizen band’s “10-codes.” It’s a way of conveying a lot of information in a small 140-character space. ReadWriteWeb covered the story here

Tweak the Tweet: New Twitter Hashtag Syntax for Sharing Information During Catastrophes

Here’s the main page for the project: HELPING HAITI: TWEAK the TWEET (TtT). You can follow them on Twitter and visit the project wiki. One of the projects we’ll be working on at this weekend’s CrisisCampPDX is Tweak-the-Tweet, so I’ll be posting an update next week. Meanwhile, I urge anyone who can to help out this worthwhile effort.

 

I’ve just learned that CrisisCamp will be starting up a Portland session this weekend. Here’s the EventBrite listing if you’re in or near Portland, Oregon:

http://crisiscamphaitipdx.eventbrite.com/

What is CrisisCamp? “This Saturday, (and Sunday if there’s interest) CrisisCamp will bring together volunteers to collaborate on technology projects which aim to assist in Haiti’s relief efforts by providing data, information, maps and technical assistance to NGOs, relief agencies and the public.”

Projects include:

Port Au Prince Basemap

We Have, We Need Exchange We-Have-We-Need-logo.png?

Languages and Translation

Mobile Applications 4 Crisis Response

NPR’s Crisis Wiki

Family Reunification Systems

Tweak the Tweet

I’ll be there — I’m hoping to help out with Tweak the Tweet.

Please join me if you can – this event is free and open to the public. You don’t have to be technical to volunteer time.  There will be projects that can be done by anybody who has used Google.

 

As you probably know, there are quite a few tools out there that attempt to “score” Twitter users. I’ve looked at most of them, and I have yet to find one that does everything. But the one that’s the most flexible, customizable and useful to me as a micro-blogger is Twitalyzer 2.0.

Twitalyzer is the brainchild of Eric T. Peterson (@erictpeterson), a noted web analytics expert and author of Web Analytics Demystified: A Marketer’s Guide to Understanding How Your Web Site Affects Your Business. Eric brings a passion for analytics and an understanding of the need for actionable metrics and reports to the Twitter scoring arena, something I haven’t seen in any other tool.

What’s new in 2.0? Quite a bit. There are more metrics, a 51-page handbook, tools for segmentation of users, benchmarks, goals, sentiment analysis, and, of course, more of the flexible dashboards and reporting that set Twitalyzer 1.0 apart from the other Twitter scoring tools. I counted 15 separate reports, and I probably missed some. You can plot trends for 22 separate metrics over time.

The two things I liked the most about Twitalyzer 1.0 were:

  1. All of the metrics were defined. You could see what was being counted and what those counts meant.
  2. There were clear recommendations on how to improve your scores.

Twitalyzer 2.0 has kept that. There are many more metrics, but they are still all defined. And the recommendations are still there, along with a new “Goals” report that allows you to set goals and track your progress towards them.

But in my view, the most important new feature of Twitalyzer 2.0 is the Segmentation / Tagging functionality. I’m still learning how to use this, but the examples in the handbook are very well written, and it’s clearly a vital part of any analytics tool set.

How does Twitalyzer compare with the other Twitter scoring tools? There are two others I’ve used in depth, TwitterGrader and Klout. TwitterGrader reports only a single score, and there is no definition of how that score is derived or what actions one should take to improve it. Klout has a few reports, a number of metrics and recommendations for how to improve them, but the Klout reports seem to be full of old data, and it can take hours for them to update your results. And I didn’t see anything like Twitalyzer’s segmentation capability.

There are a few things that could be improved.

  1. Location: Twitalyzer maintains separate lists for all “spellings” of a locality. For example, there are separate lists for “Portland, OR”, “Portland, Oregon” and “Portland, Oregon, USA”. Twitalyzer isn’t the only tool that suffers from this – TwitterGrader does too, and many tools don’t do location-based analytics at all. But it would be fairly easy to combine most of the spellings and misspellings of a given metropolitan area like Portland / Vancouver into a single location, using a combination of Twitter Search and the Google Maps Geocoding API.
  2. CSV export of metrics time series: Twitalyzer can export a single time series to CSV format now in the “Trends” menu. But there are 22 or so metrics; a combined CSV file of all of them would be very useful, especially for someone like me who wants to correlate Twitter metrics with other metrics, campaigns, events, and so on.
  3. I’d like to be able to integrate Twitalyzer data with the Clicky web analytics tools. There is Google Analytics integration now, but I’m not sure I’m going to stay with Google Analytics, even though it’s free and an “industry standard.” Clicky is real-time; Google Analytics isn’t.
 

As you probably recall, Twitter released the Streaming API into production on January 5, 2010. My in-depth analysis is here. Six days later, Twitter made the following announcement:

Search API: High-Volume and Repeated Queries Should Migrate to Streaming API

What does this mean? Well, if you’re a vendor of Twitter monitoring tools that depend on Twitter Search for your input data, you should be working on this now. Why?

As I note in my analysis of the streaming API, there is an extra relevance and ranking filter between the raw tweets coming into Twitter and the tweets that get indexed for Twitter Search. As a result of this extra filter, there are more tweets available to users of the Streaming API than there are to users of the Search API.

I haven’t found any details on the current relevance and ranking filters, where they are going, or how fast they’re planning to get where they’re going. But the message is clear:

“This transition begins a fundamental shift towards a high value, high result quality, lower query volume Search API.” And in a thread on the Twitter API Developers’ Google Group, the author of that announcement, John Kalucki, added, “Both Search and Streaming discard all statuses from low-quality users. Search additionally filters the remaining statuses for relevance and ranking purposes. This may be hard to see now, unless you cross-reference the Streaming results, but this divergence will soon accelerate and become more obvious.”

I’m already seeing some evidence of this on the Twitter Developers’ Google Group. Specifically, see this thread. The message to us developers is, “Get serious with the Streaming API.” But what about end users?

Certainly, if you’re using monitoring tools that depend on Twitter Search, you should be asking the vendors about this. But if you’re just using Twitter Search for your own purposes, I think this is very good news. Personally, I recommend using Advanced Twitter Search at http://search.twitter.com/advanced. Again, from the announcement:

“Shifting the heaviest users away from Search should dramatically improve the overall Search experience. Resources can be allocated to the search architecture’s strength: historical, complex and high value queries.”

 

Actually, now that I think about it, the birds in the PDX area don’t fly South for the winter. If the weather gets really bad, they just get a motel on the Oregon Coast. But seriously, folks, as you may know, I had four WordPress blogs, a Blogger blog, a Posterous blog and also have a LinkedIn page and a Facebook page. And I tweet. A lot. More than any other Portlander, I think.

Something had to give – and it was five of the six blogs. I’ve left the Posterous blog on line – I’m still moving posts from it to this one. And I’ve left http://borasky-research.net/smart-at-znmeb on line, because there are still some links to it out there. But I’ve imported all of its posts, comments and pages here.

I’ve imported the Blogger blog here and deleted it. And I’ve imported the Linux Capacity Planning blog and the AlgoCompSynth blog posts, comments and pages here as well and deleted them. So this is it – my only blog! Accept no substitutes!

Other changes:

  1. I’ve switched from the Carrington theme to the Atahualpa theme. I went through a lot of themes in the process, and in the end, it was the fact that I could get a three-column blog easily and the nature images at the top that led me to this one. It’s amazingly flexible and powerful.
  2. I’ve integrated IntenseDebate commenting. This means you can log in with openID, Twitter or Facebook to comment, and your comments everywhere in the IntenseDebate world can be synchronized. I didn’t really do any investigation of IntenseDebate vs. Disqus as a comment management platform – there’s a popular WordPress plugin for IntenseDebate, so I went with it.
  3. Each post has buttons at the bottom so you can post it on Twitter, Facebook, Delicious, Reddit, Digg, StumbleUpon and a few other places. Each post also has a TweetMeme retweet button.
 

Data Journalism Developer Studio 2012 Overview

Download Data Journalism Developer Studio 2012 From SUSE Gallery

Data Journalism Developer Studio on Github

Data Journalism Developer Studio 2012 Blog


The R programming language was featured about a year ago in a New York Times article (http://bit.ly/iaqQ). I’ve been an R user since 2000, so I’ve collected some resources for people who want to get started with R.

The first place to start is the R Project web site at http://www.r-project.org/. Next, you’ll actually want to install R itself. There are several options, depending on your environment.

  • Linux
    • Using your distro’s native packages. Most Linux distros either have R available in the base repositores or have it available from external repositories. The advantage of this is that it will be integrated with your package management system. The disadvantages are that you may not get the latest version of R, and there is no uniformity between distros about how R itself is named or how many R libraries are packaged.
    • Download a package from the Comprehensive R Archive Network (CRAN). Select a mirror at http://cran.r-project.org/mirrors.html. Then follow the “Linux” link at the top. That will give you packages for Ubuntu, Debian, Suse and Red Hat. Red Hat includes Red Hat Enterprise 4 and 5 plus Fedora. Suse includes both the SUSE Linux Enterprise and openSUSE versions.
    • Build from source. Instructions for doing this are at http://cran.fhcrc.org/doc/manuals/R-admin.html
  • Windows or MacOS X
    • Select a mirror at http://cran.r-project.org/mirrors.html.
    • Follow the Windows or MacOS X link in the top panel, just under the Linux link.
      • On Windows, follow the “base” link and download “R-2.10.1-win32.exe”. It’s a standard Windows installer, which you just run.
      • On MacOS X, download and install “R-2.10.1.dmg”

I usually build R from source on my Linux machines. Once you’ve got R installed, you should have most of the documentation. But everything is also available on line at http://cran.r-project.org/manuals.html. You’ll definitely want to read the Introduction at http://cran.r-project.org/doc/manuals/R-intro.html and the FAQ at http://cran.r-project.org/faqs.html.

Here’s a few books on R and statistics / data visualization:

Data Visualization and R Programming Books

 

Melissa Jun Rowley posted a marvelous piece here on Posterous: The Tale of Three Great Cities. I’m guessing all of us have a similar tale; here’s mine.

Las Vegas, Nevada

“What happens here, stays here.” Fortunately, I left before that immortal marketing campaign. I was back for a one-day visit in December of 2008, and it was like being transported to another planet. The Strip now looks like a mashup of Times Square, Disneyland and miniature versions of every famous city in the world. Where was my beloved desert? Did I really go to graduate school here? In Mathematics? That’s what the diploma says.

I hear they’re bringing back that marketing campaign. Las Vegas to revive ‘What happens here’ advertising slogan. I’ll pass.

Baltimore – Washington – Annapolis

Three airports, a seaport, two major symphony orchestras, a haven for sailors, spices, seafood, working rail travel and public transit, a restaurant for every nation in the world but one, and one of the lowest unemployment rates in the country. I wasn’t born there, but I was conceived there. It’s really a great place to live – is it any wonder the Obama family chose it as their home? It’s a great place to live, but what about to visit?

It’s been a tad over ten years since the last time I visited, and I don’t really have a reason to go back. The jazz club in Annapolis where I heard Ethel Ennis and Charlie Byrd is now a Starbucks. Historic Inns of Annapolis. Well, maybe they play some jazz CDs. No thanks.

Portland, Oregon

You’ve probably figured out by now that this blog post is a thinly-veiled advertisement for my home town, designed to promote tourism. But the thing is, I don’t actually live in Portland, but in a suburb called Aloha that, strangely enough, has nothing to do with Hawaii. And I’ve never been to Hawaii, so I can’t very well ask you to visit there, can I? So, yes, definitely visit Portland!

I’ve been here for almost 25 years. What do we have?

  1. Water. Two major rivers meet here, and fresh water literally falls out of the sky free for the taking! If you like it salty, there’s a few bays and coves a couple of hours to the West.
  2. Air. We get our air mostly fresh off the ocean, or occasionally funneled through the Columbia Gorge by a high-pressure cell. In any event, we get it before much of the US, and we try our damnedest not to add stuff to it on its way East.
  3. Mountains. Yeah, there’s one not too far away that gave us a little trouble in 1980, but for the most part, they’re pretty to look at and a great place to go skiing.
  4. Parks. There are so many, I can’t list them all, so I’ll just give you a link to my favorite. Tryon Creek State Park.
  5. Beer. Contrary to popular belief, you can get imported beer here. But why would you? Ours has better hops, has more alcohol, is served in pubs, restaurants, banquet halls and even movie theaters!
  6. Food and wine. We grow it. We catch it in the ocean. We make it. We cook it. We eat it. We package it up and ship it. And we love to share it. Our food cart scene has been featured on national television and in the New York Times.
  7. Entertainment. New York has Greenwich Village. Washington has Georgetown. Portland has Portland! Jazz, folk, rock, symphonic, chamber, ballet, opera, and two new music ensembles. Portland has numerous theater companies and a major performing arts center. We have a listener-supported classical radio station that’s heard around the world on the Internet. Oh, yeah – if you happen to hear bagpipes, they just might be coming from a unicyclist.
  8. Bloggers and Tweeters and Geeks, Oh! My! I’m a blogger. This is one of my blogs. I have five others. And a LinkedIn page. And a Facebook page. And I tweet. A lot – at last count more than any other Portlander.Geeks: we have Linus Torvalds. Perhaps you’ve heard of Linux? He invented it. We have major contributors to Perl, PostgreSQL, Ruby, WordPress and other open source projects. We have Jive Software and Zapproved. We have the Silicon Florist. We have 30 Hour Day. We have Strange Love Live. We love social media, software and (wait for it) social media software! Software is a craft here, just like belts, jewelry and beer. You can actually sit and watch us make it in coffee shops and pubs.

So if you’re looking for a great city to visit this year, we’re here! Just be careful crossing the street if you hear bagpipes.

 

Over the years, I’ve experimented with a number of tools for mind mapping, and related concepts such as dialogue mapping. I can recommend the following resources.

Coaching and Training

Books

Open Source Software

Other Mind Mapping Links

 

Data Journalism Developer Studio 2012 Overview

Download Data Journalism Developer Studio 2012 From SUSE Gallery

Data Journalism Developer Studio on Github

Data Journalism Developer Studio 2012 Blog


Disclosure

As you probably know, I live in the Portland, Oregon area and have for many years. One of the must-visit places here is Powell’s Books. The book links in this post will all take you to Powell’s as part of their Partner Program. If you’d like to join the program too, here’s the link.

Updated September 7, 2011: I recently purchased a Kindle, and quite a few of these books are now available in that format. For those that are, I’ve tweeted my Amazon Affiliate links out and have embedded those tweets here.


Updated May 18, 2010: There is an interesting discussion happening on LinkedIn about this subject, and quite a few links to data mining and predictive analytics resources have been posted there. So here’s the link. Enjoy! @znmeb


I’m a big fan of the R programming language, especially for data visualization. The following books are all books that I own and recommend.


The Grammar of Graphics (Statistics and Computing) by Leland Wilkinson http://t.co/zEz5210 via @
@znmeb
M. Edward Borasky

ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham http://t.co/GfXsOJm via @
@znmeb
M. Edward Borasky

Interactive and Dynamic Graphics for Data Analysis: With R and GGobi (Use R) by Dianne Cook http://t.co/TxTU5ov via @
@znmeb
M. Edward Borasky

Graphics of Large Datasets: Visualizing a Million by Antony Unwin http://t.co/8PKQMNI via @
@znmeb
M. Edward Borasky

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Seco... by Jerome Friedman http://t.co/UjlmX1v via @
@znmeb
M. Edward Borasky

Modern Applied Statistics with S (Statistics and Computing) by W.N. Venables http://t.co/fxEeMPb via @
@znmeb
M. Edward Borasky

S Programming (Statistics and Computing) by William Venables http://t.co/ptFdyms via @
@znmeb
M. Edward Borasky

Software for Data Analysis: Programming with R (Statistics and Computing) by John Chambers http://t.co/N3ka81j via @
@znmeb
M. Edward Borasky

Quantile Regression (Econometric Society Monographs) by Roger Koenker http://t.co/a22nCrG via @
@znmeb
M. Edward Borasky



Finding Groups in Data
by Leonard Kaufman
Powells.com

Programming with Data
by John M. Chambers
Powells.com


 

Data Journalism Developer Studio 2012LX Blog


Topic Detection and Tracking by James Allan is now available for Kindle. Here’s a tweet with my affiliate link so you can purchase it.

Topic Detection and Tracking by James Allan http://t.co/vvQbtwf7
@znmeb
M. Edward Borasky

Neal Rauhauser (@StrandedWind) pointed out that the links to Twitter documentation were broken. They should be fixed now. Thanks, Neal!


In all the excitement over the Google Nexus One announcement, another announcement came yesterday that hasn’t received a lot of attention. Twitter announced that the streaming API had graduated to production status. So what is this streaming API?

There are actually three Twitter Application Programming Interfaces (APIs). The original API, sometimes called the REpresentational State Transfer (REST) API, covers the basic Twitter functions. You can read your timeline and direct messages, tweet and retweet from the REST API, follow and unfollow other users, send direct messages (DMs), manipulate your lists, and so on.

The second is the search API. The search API duplicates what the Twitter search page does. It can do everything that the Twitter Advanced Search can do, and nearly all the Twitter monitoring tools depend on this Twitter search API.

The third API is the streaming API, the one that was released into production yesterday. It has been in alpha test since April of 2009. Before describing the “new” API and its significance, let’s look at the basic tweet flow for declarative statements.

Tweet Flow Schematic Diagram
Tweet Flow Schematic Diagram

The basic flow of tweets is as follows:

  1. A user publishes a tweet from a mobile, laptop or desktop device, using a Twitter client. The tweet is tagged with a time stamp, the name of the user that sent it, the user it was sent to if it was a reply, and, if the user has enabled geotagging, the user’s location. In other words, the tweet is a statement: “I am here now, and this is what’s happening.” The tweet is either sent to a specific user (an “@reply”) or broadcast to the world.
  2. The tweet enters the Twitter infrastructure. First, it goes into the main database. Once it is in the main database, it is also sent into a user quality filter. The quality filter is designed to remove less relevant users — users that Twitter deems to be of low quality are filtered out.If the user quality filter passes, the tweet is forwarded into a second filter, the relevance and ranking filter. If this filter also passes, the tweet is sent to the search database, to be indexed for Twitter search. See http://dev.twitter.com/pages/streaming_api_concepts#result-quality and http://help.twitter.com/forums/10713/entries/42646 for more details about the quality filters.

The top arrow on the right represents what you see when you use the REST API. When you access the REST API, you are running a query against the main database. That is, you are creating, reading, updating or deleting tweets from the main database. Like tweets that come from a Twitter client, tweets created in the main database via the REST API are sent to the quality filters.

The second arrow on the right represents the search API. When you access the search API, you are looking up tweets from the past in the search database. This database is read-only; you can’t create, update or delete tweets from it with the search API.

The bottom two arrows represent the streaming API. Like the search API, the streaming API is read-only. Unlike the search API, the streaming API is not buffered through the search database. The raw tweets pass through the same user quality filter as used to qualify tweets for the search database, but they do not go through the relevance and ranking filter. Thus, there are more tweets available to users of the streaming API than there are to users of the search API.

Then they split into two streams, the sample stream and the filter stream. The sample stream is a subset of the raw public timeline, popularly known as the Firehose. The sampling process serves to limit the processing and bandwidth capacity Twitter must provide to publish the stream, and also the processing and bandwidth capacity a subscriber must provide to accept the stream. A subscriber simply reads the sample stream directly from Twitter.

The filter stream is also a subset of the Firehose. Like the sample stream, a subscriber simply reads the stream directly from Twitter. However, as the name implies, the subscriber can filter the tweets coming out of the stream. The stream can be filtered by keywords, lists of users that created the tweets, or locations of geotagged tweets.

Why is this a big deal? There are a number of reasons, but the one that excites me the most is automated real-time journalism, using a technique called Topic Detection and Tracking (TDT). How does this work? Here’s a simple scenario:

1. The topic detection and tracking server connects to the sample stream and monitors tweets as they arrive. The sample stream is a representative sample in near real time of all the tweets being published from all over the world. After a certain period of time, the monitoring process will have established a baseline of common words and hashtags that it sees, and when and where “normal” tweets are published on average.

2. Now suppose an event occurs — for example, an ambulance arrives at the home of a celebrity. A crowd gathers and starts tweeting about it. Because of the sampling, not all of these tweets will show up in the sample stream. But if the event is newsworthy enough, word will spread, the initial reports will get retweeted, and the monitoring process will notice new words — the name of the celebrity, and perhaps some hashtags that have been created. And enough of the tweets will be geotagged to get a fix on the location of the event, even if nobody tweets about it.

3. When the monitoring process sees an event, it can then initiate a backsearch, using the search API, to collect more details about the event. The backsearch should retrieve all the tweets about the event except those that have been filtered out by Twitter’s search quality filter. This can pinpoint fairly closely the time at which the event first entered the tweet stream, and delivers a list of users tweeting about the event.

4. A filter can now be constructed and a filter stream set up with keywords, location and users. In addition, using the REST API, all of the tweets from these users can be retrieved, in case some were missed by the other processes. Tweets of users blocked by the quality filter can be retrieved via the REST API if their names were discovered in a previous step.

Using these processes, a complete “bird’s eye view” of the event can be constructed. The process can create Twitter lists of the users using the REST API, send alerts, publish an RSS feed about the event, and even do some more sophisticated natural language processing.

For more information about the streaming API, see http://dev.twitter.com/pages/streaming_api. A detailed description of the mathematics of Topic Detection and Tracking can be found at Semantic Classes in Topic Detection and Tracking.

© 2011 Borasky Research Journal Suffusion theme by Sayontan Sinha