Analytics Ascendant, Part 2: The Limits of Predictive Modeling

In my last post on predictive modeling (4 August 2009) I used the recent announcement that the Netflix Prize appears to have been won to make two points. First, predictive modeling based on huge amounts of consumer/customer data is becoming more important and more prevalent throughout business (and other aspects of life as well). Second, the power of predictive modeling to deliver improved results may seduce us into believing that just because we can predict something, we understand it.

Perhaps because it fuses popular culture with predictive modeling, Cinematch (Netflix’ recommendation engine) seemed like a good example to use in making these points. For one thing, if predicting movie viewers’ preferences were easy, the motion picture industry would probably have figured out how to do it at some early stage in the production process–not that they haven’t tried. A recent approach uses neural network modeling to predict box office success from the characteristics of the screenplay (you can read Malcom Gladwell’s article in The New Yorker titled “The Formula” for a narrative account of this effort). The market is segmented by product differentiation (e.g., genres) as well as …

This brings to mind a paradox of predictive modeling (PM). PM can work pretty well in the aggregate (and perhaps allowing Netflix to do a good job of estimating demand for different titles in the backlist) but not so well when it comes to predicting a given individual’s preferences. I tend to be reminded of this every time I look at the list of movies that Cinematch predicts that I’ll love. For each recommended film, there’s a list of one or more other films that form the basis for the recommendation. I’m struck by the often wide disparities between the recommended film and the films that led to the recommendation. One example: Cinematch recommended “Little Miss Sunshine” (my predicted rating is 4.9, compared to an average of 3.9) because I also liked “There Will Be Blood,” “Fargo,” and “Syriana.” It would be hard to find three films more different than “Little Miss Sunshine.” “Mostly Martha” is another example. This is a German film in the “foreign romance” genre that was remade as “No Reservations” in the U.S. with Catherine Zeta-Jones. Cinematch based its recommendation on the fact that I liked “The Station Agent.” These two films have almost no objective elements in common. They are in different languages, set in different countries, with very different story lines, cast and so forth. But they share many subjective elements (great acting, characters you care about, and humor, among others) and it’s easy to imagine that someone who likes one of these will enjoy the other. On the other hand, Cinematch made a lot of strange recommendations (such as “Amelie,” a French romantic comedy) based on the fact that I enjoyed Gandhi, the Oscar-winning 1982 biopic that starred Ben Kingsley.

Like all predictive models, Cinematch takes the available information and makes a best guess. According to the Netflix Prize FAQs, Cinematch does it with “straightforward statistical models with a lot of data conditioning.” Netflix also uses some additional data sources (beyond the simple ratings).

What the recommendations reveal is that there are enough people in the database who rated pairs of movies that the algorithm can make a reasonable guess about how much I’ll enjoy one movie given that I’ve enjoyed the other. The result is, conceptually at least, a huge hyper-dimensional matrix of conditional probabilities. Cinematch does not provide me with a measure of its confidence in the prediction, but I can infer it by comparing my predicted rating with the average member rating. I’m guessing that the bigger the difference, the more information Cinematch had available.

The point of collaborative filtering systems is to match people (especially online shoppers) who appear to be similar in their preferences. Amazon can do it by looking at wish-lists, browsing, and purchasing behavior. Cinematch does it a little more systematically by collecting ratings. It’s sort of like automated word-of-mouth, and as I noted in my previous post, this is crucial to driving a long-tail business based on experience goods.

Back to the Netflix Prize for a moment. Clive Thompson, in The New York Times Sunday Magazine (”If You Liked This, You’re Sure to Love That,” 23 November 2008), reports that a major breakthrough occurred very early in the competition. A team called “Simon Funk” jumped nearly 40% of the way toward the 10% improvement goal by using singular value decomposition. SVD is a data reduction method similar to (or one or) the factor-analytic methods commonly used in marketing research. In effect, SVD creates clusters of films based on the correlations in the ratings given to those films. Depending on the components of the clustering, SVD (like factor analysis) might reveal reasons for the clustering (if, for example, romantic comedies have strong positive loadings on one component, and action adventure films have strong negative loadings on the same component). Using something like SVD addresses some of the challenges presented by the Cinematch data. With SVD, the predictive algorithm can work with a correlation matrix rather than millions of individual comparisons yet keep all of the information contained in those comparisons. That makes it easier to “borrow” information to fill in the gaps in ratings data, which is particularly useful in cases where you have wide variation in the object sets rated by each Netflix member.

What’s missing from Netflix right now–and maybe the winning solutions incorporate this–is a way to account for differences in individuals. When I contemplated using hierarchical Bayesian modeling for this exercise, I had in mind inferring relevant characteristics about each individual from patterns in their ratings. A Bayesian approach has several potential advantages, such as taking into account the limited (and potentially biased) set of observations that Cinematch collects from each member. Of course, Netflix could collect additional information from members in the form of a survey and incorporate that (perhaps via segmentation analysis) additional information that would help them make recommendations.

For me, the ideal predictive model is one that makes the underlying behavioral model explicit, meaning that the predictive model incorporates variables that have a causal relationship to the outcome of interest. In addition, an ideal model will account for heterogeneity across consumers. The starting point in developing an ideal predictive model is mapping out the underlying behavioral model. Then you can hang data on the proposed model and start crunching.