Business Intelligence versus Data Science

Business Intelligence (BI) professionals are beginning to connect with me on LinkedIn, Twitter, and Facebook. As I read their commentary (such as this), I realize how the market has split to support complex, real-time prediction tasks.

When I think about working on real-time prediction tasks, I’m worried about throughput. The easy part of the job is fitting a model. The difficult part of the job is implementing the model in a real-time prediction system. I’ve been working with a few friends on a toy example for explanatory purposes. The toy is an iPhone application that recognizes a wine label (from its picture) and then integrates with Cellartracker to allow insertion/deletion from Cellartracker’s inventory.

The Toy Problem

In this toy problem, an iPhone takes a picture of a wine label (front and/or back) and sends it (using HTTP) to a web server. The web server identifies the full name of the wine by matching the picture against a database using a model that is developed off-line and updated nightly or weekly. For the product managers reading this, note that this process is much more accurate than the Cor.kz iPhone app, which relies on a bar code on each wine label. Bar code use in the wine industry is not reliable for many reasons.

Many of you may be reading this and thinking: “How can matching a picture to a wine label database be a toy problem?” First, there are fewer than a 300,000 typically used wine labels. That’s a small number to a Data Scientist. Second, the algorithms for matching pictures with class labels (i.e. a wine name) are a well understood science. There are challenges, which I’ll discuss later. The Data Scientist will usually know this or be able to find it quickly in the literature and get a toy model running quickly. The first toy model that I used for this example came from Andrew Ng’s Standford Machine Learning course and it took me a few hours to implement after I obtained the database of wine labels.

Another question that may arise is why the modeling is complicated at all. I agree. In the toy model, the modeling is not complicated enough to show off high end BI or DS skills.

The toy model matches in the neighborhood of 90% of the wine label pictures to their appropriate wines (when the iPhone pictures are of reasonably high quality). The gain from the cor.kz’s application (which uses bar codes) was about 40%. The toy model takes a few hours to train and it works equivalently using either logistic regression or a neural network as the machine learning algorithm. More important is the implementation of the pipeline process for examining the pixels in the photo. Analyzing the photo is broken down into a series of tasks, beginning with text/pattern region recognition, followed by text/pattern recognition, followed by wine label recognition. Each task uses machine learning features built from the previous tasks.

From Toy Problem to Production

This toy problem is difficult to implement on the Internet because the actual compute cost of classifying the iPhone picture of a wine label is large enough that the transaction cost (in both time and compute resources) becomes a problem. Therefore, the key skill required to move from the basic modeling task to production is the ability to build a scalable web application. Some Data Scientists can do this. Most can’t.

One of the reasons that this is a toy problem is that it takes very little effort to get something that works. If you are a Business Intelligence (BI) professional and you can tackle this sort of problem effectively, let me know! The BIs I’ve met so far aren’t readily familiar with how to make this example work while the DS professionals I interact with regularly know how to make the Toy work but struggle with the production system. If a DS has the ability to make the system work in real-time, they usually work for Google, Yahoo or Bing, and they are highly sought after.

Data Science Skills

To further underscore the differentiation in the skills, I can walk through some of the tasks. These skills can be uncovered through interviewing, instead of the typical vacuous questions about permutations and combinations.

At least initially, model fitting is done outside of a production pipeline. It’s an off-line activity. The process begins with sampling. Can I take 10 wine bottles from my wine cellar, take pictures of their labels, put the pictures on a computer and classify them correctly (where the multiclass classifier has one class for each wine in the database)?

During this initial model fitting, experimental design is discussed. So are Learning curves, Underfitting, Overfitting, Sampling Bias, Bias-Variance trade-off, Feature Generation, whether more features or more labeled data would help, etc. These concepts should be understood by all Data Scientists. I hope it would be understood by BIs but that’s not as clear to me. There are also many topics for further discussion, such as Error Analysis and reporting.

Model fitting is only one part of this problem. I actually view it as the easy part. It tells me whether it is feasible to solve the business problem (predicting the wine label). It can be solved with Octave, Matlab, R, Perl, Python, Java, C, C#, C++, Fortran, or most other programming languages. I personally find that Octave and Matlab are the fastest prototyping languages for this because they economize on code and yet deliver fast performance. But I’m sure many tools would work adequately. Over time, I would seek the prototyping tools that minimize my costs for playing with the data. In some big data problems, Hadoop and Pig have some usefulness here, but unless someone else is maintaining their runtime environment, they are usually overkill for prototyping.

Once a good discussion of model fitting is finished, the next discussion is about implementation challenges in a production environment. This discussion can go in a lot of directions, but the fundamental problem is user response time. The iPhone app needs to submit the picture to the web server and get an answer fairly quickly. How will this be implemented?

A good implementation discussion will talk about the approach to optimizing the pipeline to achieve the user requirements for real-time response. A great DS should discuss measuring the bottlenecks in the pipeline and different methods to optimize them. For instance, the performance implications of using multiple logistic regression classifiers versus a conditional random field for text region classification.

Reporting and Dashboards

Finally, both a BI and a DS should be able to a simple dashboard to communicate how well the process is working. They should be able to discuss key performance indicators (KPIs) and a method of quickly communicating the status of the KPIs.

Conclusion

I see a stark difference between BI and DS. When I watch BI professionals, I hear them discuss analysis and model fitting. When Google or Bing discuss DS, they sound much more like software engineers than BIs. Their organizations prefer to hire implementers, but these skills are harder to obtain. There are many non-CS professionals that work in BI, have great modeling skills, but would struggle to implement this “toy example” that I’ve outlined today. However, having great CS implementation skills does not necessarily imply that the implementer takes the best of the available past work from social science’s rich modeling history. Newer graduates from interdisciplinary academic programs may eventually have both, but these people are still rare.

Blog comments powered by Disqus