Data Journalism Developer Studio 2012 Overview
Download Data Journalism Developer Studio 2012 From SUSE Gallery
Data Journalism Developer Studio on Github
Data Journalism Developer Studio 2012 Blog
The rise of Twitter has made large quantities of text available in multiple languages. As a result, text processing, text analytics and other natural language processing techniques have become a staple in business intelligence. So I’ve put together a list of what I think are the essential references in the area. I’ve attempted to arrange them in order of increasing mathematical sophistication. And, as always, I’ve provided Powell’s Partner Program links so you can buy them.
Most of the algorithms described in these books are available in the Data Journalism Developer Studio. For a complete description of the toolkit, see About The Data Journalism Developer Studio.
Even if you’re not a Python programmer, this book is probably the best place to start. The book will walk you through the Python language, and it’s written by the experts on the Python Natural Language Tool Kit. The Python Natural Language Toolkit is one of the featured components of my Social Media Analytics Research Toolkit.
For Perl programmers, this book is a good place to start. Topics include pattern matching, data structures, probability, information retrieval, corpus linguistics, multivariate statistics, clustering and an introduction to R programming. Both Perl and R are available in the Social Media Analytics Research Toolkit.
After you’ve gotten started, this book will give you a good overview of the more technical and mathematical aspects of natural language processing. Topics include classical approaches, empirical and statistical approaches and applications. Machine translation, speech recognition, information retrieval, question answering, ontology construction and sentiment analysis are all covered.
The chapter on sentiment analysis is particularly well done. It covers most current techniques and includes sections on dealing with spam. Sentiment analysis is still somewhat controversial, although nearly all social media monitoring providers include it in some form. This chapter provides much-needed clarity on just what is and isn’t possible in sentiment analysis. Chapter 13, “Normalized Web Distance and Word Similarity”, is also notable. It describes the algorithms used in the CompLearn suite of programs that are part of the Social Media Analytics Research Toolkit.
This isn’t strictly a book about either text processing or natural language processing, but I’ve included it for three reasons:
- It covers all of the matrix decompositions one would use in text processing and natural language processing.
- It covers algorithms for social graph analysis.
- It has a very readable introduction to using tensors – arrays with more than two dimensions – in data mining. My opinion is that tensor-based algorithms are the future of natural language processing in general and text analytics in particular.
While also not totally about text / natural language processing, this book is an excellent overview of the technologies used in counterterrorism. There’s not as much technical detail as there is in the other books – you’ll need to go following the references. I’ve included this book because I see great potential for some of the technologies in business intelligence. For example, mining data for people in a social media site who “should” be friends or followers but aren’t is one technique businesses could “borrow” from law enforcement.
This book is an excellent overview of some of the more recent research in text mining. It includes chapters on “Detection of Bias in Media Outlets with Statistical Learning Methods”, “Topic Models”, “Utility-Based Information Distillation” and “Adaptive Information Filtering”. But in my opinion one chapter, “Nonnegative Matrix and Tensor Factorization for Discussion Tracking”, justifies the purchase of the book on its own.
Much of modern text processing depends on linear algebra over so-called “bag of words” vector space models. In such a model, keywords are extracted from the text and a collection of documents — called a corpus — is represented by arrays of keyword frequencies. In these models, a matrix is a two-dimensional array of the frequencies, usually indexed by keywords for rows and documents or document authors by columns.
Latent Semantic Analysis, sometimes called Latent Semantic Indexing, is a common technique in natural language processing, and this book explores it in both mathematical and practical detail. There is also a chapter on probabilistic topic modeling, sometimes called latent Dirichlet analysis. If you want to experiment with these techniques, I recommend the open source Java-language Mallet package. Mallet is included in the Social Media Analytics Research Toolkit, as are R language tools for latent semantic analysis.
Finally, we come to my current area of research, Topic Detection and Tracking. This book is the classic reference on the subject, and is required reading if you’re interested in automated journalism.
Appendix – R Language Natural Language Processing Task View
In addition to the Python Natural Language Tool Kit and Mallet, the Social Media Analytics Research Toolkit contains the R Natural Language Processing Task View. Here’s a copy of the contents of that task view as of 2010-08-23:
CRAN Task View: Natural Language Processing
Didn’t find what you’re looking for?







RT @bentrem: ReTw @ znmeb – #BusinessIntelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal > http://meb.tw/d6Pwzy <
ReTw @ znmeb – #BusinessIntelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal > http://meb.tw/d6Pwzy <
RT @znmeb: Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal http://meb.tw/d6Pwzy
RT @DZone "Business Intelligence From Twitter Text? Six Books To Get You Started!" http://dzone.com/xwtR
RT @znmeb: Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal http://meb.tw/d6Pwzy
RT @brett_colbert: Business Intelligence From Twitter Text? Six Books To Get You Started! http://bit.ly/arMx8T via @AddToAny
RT @brett_colbert: Business Intelligence From Twitter Text? Six Books To Get You Started! http://bit.ly/arMx8T via @AddToAny
Brilliance in Portland RT @znmeb: Business Intelligence From Twitter Text? Six Books –Borasky Research Journal http://meb.tw/d6Pwzy #measure
Brilliance in Portland RT @znmeb: Business Intelligence From Twitter Text? Six Books –Borasky Research Journal http://meb.tw/d6Pwzy #measure
Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal http://meb.tw/d6Pwzy #bi
Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal http://meb.tw/d6Pwzy
RT @znmeb Business Intelligence From Twitter Text? Six Books To Get You Started! http://meb.tw/d6Pwzy #ireadSBoSM #BI
Business Intelligence From Twitter Text? Six Books To Get You Started! http://meb.tw/d6Pwzy #ireadSBoSM
Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal: http://bit.ly/cWWePt via @addthis
RT @eicg: RT @znmeb: Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal http://meb.tw/d6Pwzy #measure
RT @znmeb: Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal http://meb.tw/d6Pwzy #measure
RT @znmeb: Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal http://meb.tw/d6Pwzy #measure
Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal http://meb.tw/d6Pwzy #measure
Business Intelligence From Twitter Text? Six Books To Get You Started! #measure #bi http://meb.tw/d6Pwzy
RT @DKALab: RT @znmeb: Business Intelligence From Twitter Text? Six Books To Get You Started! http://meb.tw/d6Pwzy #measure
Business Intelligence From Twitter Text? Six Books To Get You Started! – Borasky Research Journal http://meb.tw/d6Pwzy
RT @znmeb: Business Intelligence From Twitter Text? Six Books To Get You Started! http://meb.tw/d6Pwzy #measure
Business Intelligence From Twitter Text? Six Books To Get You Started! http://meb.tw/d6Pwzy #measure