znmeb

May 112012
 

Update 2012-05-13 11:38 PDT:

Again, this isn’t news – we’ve heard this before:

“If investment professionals won’t work to correct the flaws of finance, who will?” – Tom Brakke

More from Josh Brown:

And Above the Market:


Update 2012-05-11 20:30 PDT: You just can’t make this stuff up:

“JPMorgan Chase CEO Jamie Dimon says he does not know whether the bank broke any laws in the surprise $2 billion loss by one of its trading groups.”

“Dimon says the bank was ‘sloppy’ and ‘stupid’ in its handling of the trades.”

JPMorgan Chase CEO: I Have No Idea Whether We Broke The Law, Huffington Post


This blog post started out as a review of Josh Brown’s Backstage Wall Street: An Insider’s Guide to Knowing Who to Trust, Who to Run From, and How to Maximize Your Investments. Josh (@reformedbroker) is one of the best financial bloggers / tweeters on the planet. I’m a regular reader of Josh’s blog, so I went by to see what’s new, and discovered this post. Followed by this post. Which led me to this post on Kid Dynamite’s World.

If you follow my Scoop.it profile, you’ll know that I’m curating a topic about Algorithmic Trading and Market Microstructure. I’m trying to capture opinions on both “sides” in the “debate” over whether high-frequency trading is a good thing or not. I haven’t formed an opinion of my own, except to note that the mathematics are sufficiently complex and the stakes sufficiently high that accountability and transparency are vital. That to me is what computational journalism is about – opening up the “black boxes”, or, as the late Mike Wallace put it, “Comforting the afflicted and afflicting the comfortable.”

So when the story broke yesterday that JPMorgan Chase had disclosed $2 billion in trading losses, and I started following links from financial bloggers, I wasn’t exactly shocked by what I was reading. Combining the two notable aspects of contemporary finance – algorithmic high-frequency trading and systemically dangerous institutions like JPMorgan Chase – led me to the conclusion that, as my mother would have put it, “Wall Street is too smart for their own damn good.”

The problem is, though, is that Wall Street is not only too smart for their own good, they’re too smart for your own good too. When the CEO of JPMorgan Chase says things like, “These were egregious mistakes…They were self-inflicted and this is not how we want to run a business”, I don’t see how he can expect me to trust him.

The problem is that we keep hearing this story again and again. I’ve been interested in computational finance and trading systems since 1982. In that time, we’ve had the Crash of 1987, Long-Term Capital Management, rogue traders too numerous to count, the disasters of 2007 – 2008, millions of MF Global funds stolen from customers and lost in bad derivatives bets, banks raising fees and laying off thousands of customer-facing employees, and now this.  A chief executive of a major financial institution was forced to admit that supposedly “smart people” working for the bank had lost $2 billion trading derivatives, and that he didn’t know whether that was the full extent of the losses!

My mother had another saying: “If what you’re doing isn’t working, stop doing it!” I know Josh will come up with a much better rant on this subject than I can, but if you’re the CEO of a bank and you don’t even know whether what you’re doing is working or not, that’s an entirely different ball game! That’s plain-old garden variety incompetence. As Susan Scott so eloquenty put it,

“I don’t know about you, but I have not yet witnessed a spontaneous recovery from incompetence.” – Susan Scott, Fierce Conversations.

Am I saying that Jamie Dimon is incompetent? In a word, “Yes”.

 

May 102012
 

Version 1.2.5 Released

I’ve just pushed all the buttons to release Computational Journalism Server 1.2.5. It’s mostly bug fixes and miscellaneous cleanup changes, but there is one new major option: Apache™ Hadoop™. Right now, all that’s there is a script to download and install the latest stable Hadoop from Apache and run the single-node test script. But it should be enough for developers to start testing the R Hadoop interface routines ‘HadoopStreaming‘ and ‘hive‘. See Parallel R for some sample code using ‘HadoopStreaming’ and ‘hive’.

To install Hadoop, do the following:

  1. Log into the server as “root”.
  2. Type “cd ~/Computational-Journalism-Server/Hadoop”.
  3. Type “./install-hadoop.bash”.
  4. Type “./test-hadoop.bash”.

This should install Hadoop and run the single-user test. For more information on configuring larger-scale Hadoop clusters, see the main documentation page at http://hadoop.apache.org/common/docs/r1.0.2/. The scripts in this release came from the Single Node Setup page.

Road Map

As a few recent posts on this blog have noted, I’m planning to migrate the platform components of Computational Journalism Server to either CloudFreeStyle or OpenShift or both, to take advantage of their existing platform-level components and community support structures. I don’t have a good estimate of dates yet, but there will be at least one more release as an openSUSE appliance before there’s any OpenShift or CloudFreeStyle release. There will also be one more release of Data Journalism Developer Studio 2012LX to catch up to the openSUSE Build Service packages.

May 092012
 

Updated 2012-05-14 13:28 PDT – Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement has gone to print. Congratulations to Eric Redmond and Jim R. Wilson!


As I’ve noted before, I believe any discipline is defined by its practitioners and the tools they use, and computational journalism is no exception. I think computational journalism is a superset of data journalism, but certainly the terms “data journalism” and “data science” cover the bulk of the tools I have collected and used.

In any case, to do computational journalism, at least some data must be collected, stored, explored, analyzed, cleaned, managed and “governed.” In the past few years, the “traditional” tools for doing this, called relational database management systems (RDBMS), have been supplemented by a new class of tools broadly known as “NoSQL” databases. The name NoSQL comes from the most widely used language for dealing with a traditional RDBMS, SQL.

The NoSQL field is rapidly evolving, but enough knowledge exists to fill several books. The best overview of databases for computational journalists I’ve found comes from a soon-to-be-released work from Eric Redmond and Jim R. Wilson called Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement..

I’ve been working through the book, which has been available for a few months in beta from the publisher, Pragmatic Programmers, LLC, in the course of collecting the tools for Data Journalism Developer Studio 2012LX and Computational Journalism Server. My goal is to have all of the databases available in both appliances, although at the moment only PostgreSQL, MongoDB, CouchDB and Redis are available directly from SUSE Studio.

Seven Databases in Seven Weeks covers, in order:

  • PostgreSQL, a traditional RDBMS,
  • Riak, a key-value database
  • HBase, a columnar database
  • MongoDB, a document-oriented database
  • CouchDB, a document-oriented database,
  • Neo4j, a graph-oriented database, and
  • Redis, a key-value database / data structure server.

All of these databases are open source, and they’re all supported by either a corporate entitiy, a non-profit foundation, or some combination of the two. The title really should have been “Seven Databases in Seven Weekends”; each database is covered in three-day hands-on sessions and could easily be done as a series of weekend projects. The book is hands-on – you’ll build things with these databases, including a Node.js application combining Redis, CouchDB and Neo4j into an application that provides a “band information service.”

Appendix A contains a pair of tables that give an overview of the distinguishing characteristics of the seven databases. As the authors put it, “Although the tables are not a replacement for a true understanding, they should provide you with an at-a-glance sense of what each database is capable of, where it falls short, and how it fits into the modern database landscape.”

I believe all of these databases have a place in modern computational journalism, as do the other two well-known open source RDBMS tools, MySQL and SQLite. In particular, for spatial / mapping projects, PostgreSQL, SQLite, MongoDB and CouchDB have robust geographic information systems capabilities either built in or available as add-ons.

Riak, HBase, MongoDB and CouchDB all support “big data” applications implemented via MapReduce. MongoDB and CouchDB both store their documents as JavaScript Object Notation (JSON) objects, which is the “native” format for Twitter data. Neo4j, as a graph database, is perfect for storing data about relationships, such as the interconnections between corporate executives and legislators. And because of its speed, Redis can serve as high-speed pipelines between other components in almost any application architecture.

I think NoSQL databases will be the core of computational journalism for the next few years. The RDBMS isn’t going away, of course, but if you limit yourself to “SQL thinking” or even “object-relational models” and “model-view-controller” architectures, there will be applications you can’t build. This book will get you up to speed as fast as you’re willing to go.

May 082012
 

There are quite a few books out now on “data science”. I’ve picked out three that I think are the best place to start for computational journalists. First is Machine Learning for Hackers, by Drew Conway and John Myles White. The authors are frequent contributors to the #rstats hashtag; R is the “native language” for this book. Topics of interest to computational journalists include

  • Email processing using Bayesian classifiers to detect spam,
  • An analysis of “polarization” in roll call votes in the United States Senate, and
  • Building a Twitter follower recommendation system.

I’m in the final stages of releasing version 1.2.5 of the Computational Journalism Server, and one of the goals is for the examples in this book to run.

Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites, by Matthew A. Russell, uses the Python language rather than R. This might make it more accessible to computational journalists. The book includes

  • scripts for accessing Facebook, Twitter, LinkedIn and Google+ APIs,
  • an excellent explanation of the basics of statistical natural language processing,
  • tools for building HTML5 / JavaScript visualizations, and
  • tools for exploring microformats.

This is more of a workstation resource than a server resource, so I’d recommend downloading Data Journalism Developer Studio 2012LX to experiment with these tools.

Finally, for those of you working in the area of politics and social media, I highly recommend Social Network Analysis for Startups: Finding connections on the social web by Maksim Tsvetovat and Alexander Kouznetsov. Like the previous book, the “native language” here is Python. Some of the topics and tools are also covered in Mining the Social Web, but there’s more depth here. You’ll also want to review the free webinars from O’Reilly on the subjects in the book.