Big Data, Sparse Data
It is clear that “big data” has arrived at INFORMS. And I don’t just mean the very entertaining late afternoon panel discussion on Big Data hosted by Thornton May. Big data was lurking in every presentation I went to yesterday. From Scott Nicholson, Data Scientist at LinkedIn, using their massive social network to understand the relationship between different professional skill sets or Hal Varian of Google using data on search queries to “nowcast” unemployment to Rex Davis of dunnhumby using their petabyte of loyalty card data to understand which products are substitutes, it is clear that modern analytics projects are built on a foundation of rich and massive data. As someone who was trained someone trained in statistics and decision theory, I’ve spent half a career focused on the uncertainty we have when we don’t have enough data. And when I see these amazingly rich data sets, it makes me wonder if my well-honed skills in extracting information from small data sets are still needed. Should I forget confidence intervals and learn how to map reduce?
But beyond the big data hype, I noticed another common thread across the presentations I saw yesterday. As you dig into the mountains and mountains of big data, you often run into unexpected sparcity. As Hal drilled down into the search data, he mentioned that there are search queries that don’t appear very often even in the hundreds of millions of searches that are typed into Google every day. And, despite the power of a petabyte of market basket, Rex acknowledges that they still don’t know quite what to do with new products that have little market history. So, uncertainty still exists if you know where to look for it.
In fact, characterizing uncertainty may be more important than ever as we face situations where we know some things with near certainty (like 12 oz Coke is almost always the most frequently purchases SKU in a grocery store) and then run into other situations where we know very little (like how well scrapple, my favorite Philadelphia delicacy, would sell in a Southern California grocery). Big data provides us with the challenge of measuring this uncertainty, incorporating it into prescriptive models and, most importantly, helping managers understand that even with petabytes of data there are still some things we just don’t know.