I had an interesting chat with a friend who’s an OR/Analytics consultant at one of the poster sessions. An offhand comment was made about the slowness of extracting data from a database so that it can be used to build a predictive model. I jumped in with a few thoughts on analytical database technologies that were happily received as valuable. The friend realized that he, with an OR background, was not as well prepared to think through the full lifecycle of data in a system as I am, with a computer science/machine learning background. (He’s much better at many other things, though!)
So I wanted to take this soapbox to mention a few things about databases that might be new to folks used to only working with data once it’s been collected and the analytics and optimization questions have been posed.
First, there was at least a little bit of buzz at this conference about Hadoop. Hadoop is way to manage data and perform computations when it’s distributed across hundreds to tens of thousands of machines in a cluster. It’s great, and has revolutionized analysis of “web scale” data. Hardly anyone at this conference (except Twitter!) has web scale data. You should use Hadoop when it’s prohibitively expensive to copy your data, even once. Otherwise, it’s probably overkill. Some folks at Microsoft wrote a great paper a few years ago called “Nobody ever got fired for using Hadoop on a cluster.” If you’re tempted, read that first.
But, you’re saying, my MySQL or Oracle or whatever system is deathly slow to perform analytic queries on. Yep, it sure is. But there are alternative SQL databases that are faster for the type of ad hoc aggregations and queries that OR/Analytics people typically perform. Remember that relational databases are designed and optimized for very fast reads and writes to small numbers of rows at a time — transactional operations. In fact, data is stored row-wise, so you have to read the whole data set to get access to a single column.
The alternative, which has been around for a while but has seen recent resurgence, is columnar databases. With columnar databases, each column in a table is stored (more-or-less) independently. This means that if you only need a small number of columns, the computer never has to look at the other columns. Plus, it’s much easier to compress data of the same type, so many fewer bytes (or gigabytes) need be read off the disk. Just by copying your data into a columnar database, analytical SQL queries typically will run 10 times faster, or more, with no additional optimization. They’re probably the best current solution when your data set is structured and in the 100 GB to 10 TB range.
There are some downsides — joins aren’t as fast, typically, and updating data can be very slow in some cases. It’s not a good option for unstructured data, graph data, or stuff better stored in a NoSQL database. And the enterprise-scale versions aren’t that cheap, although it’s cheap to get started.
The best first thing to try is probably Amazon Redshift. You can spin up a database and import your data very quickly, use standard Postgres SQL (with a few minor changes), and it’s very cheap to try out. The first time your query comes back in seconds rather than an hour, you’ll be hooked. If the approach seems like a good option, there are many other good alternative columnar databases, each slightly different: HP Vertica (free community edition), Infobright (free community edition), MonetDB (open source), and some others.
Final comment: It’s worthwhile learning a bit about databases, even if you have no decision-making authority in your organization, and don’t feel like becoming a database administrator (good call). But by getting involved early in the data-collection process, when IT folks are sitting around a table arguing about platform questions, you can get a word in occasionally about the things that matter for analytics — collecting all the data, storing it in a way friendly to later analytics, and so forth.
Thoughts? Leave a comment, or tweet me @harlanh!