We all know that given reasonable data, a good predictive modeler can build a model that works well and helps make makes better decisions than what is currently used in your organization (at least in our own minds). Newer data, sophisticated algorithms, and a seasoned analyst are all working in our favor when we build these models, and if success were measured by accuracy (as they are in most data mining competitions), we’re in great shape. Yes, there are always gotchas and glitches along the way. But when my deliverable is only slideware, even of the modeling is hard, I’m confident of being able to declare victory at the end.
However, the reality is that there is much more to the transition from cool model to actual deployment than a nice slide deck and paper accepted at one’s favorite predictive analytics, data mining or big data conference. In these venues, the winning models are those that are “accurate” (more on that later) and have used creative analysis techniques to find the solution; we won’t submit a paper when we only had to press the “go” button and have the data mining software give us a great solution!
For me, the gold standard is deployment. If the model gets used and improves the decisions an organization makes, I’ve succeeded. Three ways to increase the likelihood your models are deployed are:
1) Make sure the model stakeholder designs deployment into the project from the beginning
The model stakeholder is the individual, usually a manager, who is the advocate of predictive models to decision-makers. It is possible that a senior-level modeler can do this task, but that person must be able to switch hit: he or she must be able to speak the language of management and be able to talk technical detail to analytics. This may require more than one trusted person: the manager, who is responsible and makes the ultimate decisions about the models, and the lead modeler, who is responsible for the technical aspects of the model. It is more than “talking the talk” and knowing buzz-words in both realms; the person or persons must truly be “one of” both groups.
For those who have followed my blog posts and conference talks, you know I am a big advocate of the CRISP-DM process model (or equivalent methodologies, which seem to be endless). I’ve referred to CRISP-DM often, including on topics related to what data miners need to learnand Defining the Target Variable, just as two examples.
The stakeholder must not only understand the business of objectives of the model (Business Understanding in CRISP-DM), but must be present during discussions take place related to which models will be built. It is essential that reasonable expectations are put into place from the beginning, including what a good model will “look like” (accuracy and interpretability) and how the final model will be deployed.
I’ve seen far too many projects die or become inconsequential because either the wrong objectives were used in building the models, meaning the models were operationally useless, or because the deployment of the models was not considered, meaning again that the models were operationally useless. As an example, on one project, the model was assumed to be able to be run within a rules engine, but the models that were built were not rules at all, but were complex non-linear models that could not be translated into rules. The problem obviously could have been avoided had this disconnect been verbalized early in the modeling process.
2) Make sure modelers understand the purpose of the models
The modelers must know how the models will be used and what metrics should be used to judge model performance. A good summary of typical error metrics used by modelers is found here. However, for most of the models I have deployed in customer acquisition, retention, and risk modeling, the treatment based on the model is never applied to the entire population (we don’t mail everyone, just a subset). So the metrics that make the most sense are often ones like “lift after the top decile”, maximum cumulative net revenue, top 1000 scores to be investigated, etc. I’ve actually seen negative correlations between the ranking of models based on global metrics (like classification error or R^2) vs. the ranking based on subset selection ranking, such as top 1000 scores; very different models may be deployed depending on the metric one uses to assess them. If modelers aren’t aware of the metric to be used, the wrong model can be selected, even one that does worse than the current approach.
Second, if the modelers don’t understand how the models will be deployed operationally, they may find a fantastic model, one that maximizes the right metric, but is useless. The Neflix Prize is a great example: the final winning model was accurate but far too complex to be used. Netflix extracted key pieces to the models to operationalize instead. I’ve had customers stipulate to me that “no more than 10 variables can be included in the final model”. If modelers aren’t aware of specific timelines or implementation constraints, a great but useless model can be the result.
3) Make sure the model stakeholder understands what the models can and can’t do
In the effort to get models deployed, I’ve seen models elevated to a status they don’t deserve, most often by exaggerating their accuracy and expected performance once in operation. I understand why modelers may do this: they have a direct stake in what they did. But the manager must be more skeptical and conservative.
One of the most successful colleagues I’ve ever worked with used to assess model performance on held-out data using the metric we had been given (maximum depth one could mail to and still achieve the pre-determined response rate). But then he always backed off what was reported to his managers by about 10% to give some wiggle room. Why? Because even in our best efforts, there is still a danger that the data environment after the model is deployed will differ from that used in building the models, thus reducing the effectiveness of the models.
A second problem for the model stakeholder is communicating an interpretation of the models to decision-makers. I’ve had to do this exercise several times in the past few months and it is always eye-opening when I try to explain the patterns a model is finding when the model is itself complex. We can describe overall trends (“on average”, more of X increases the model score) and we can also describe specific patterns (when observable fields X and Y are both high, the model score is high). Both are needed to communicate what the models do, but have to connect with what a decision-maker understands about the problem. If it doesn’t make sense, the model won’t be used. If it is too obvious, the model isn’t worth being used.
The ideal model for me is one where the decision-maker nods knowingly at the “on average” effects (these should usually be obvious). Then, once you throw in some specific patterns, he or she should scrunch his/her eyes, think a bit, then smile as the implications of the pattern dawns on them as that pattern really does make sense (but was previously not considered).
As predictive modelers, we know that absolutes are hard to come by, so even if these three principles are adhered to, other factors can sabotage the deployment of a model. Nevertheless, in general, these steps will increase the likelihood that models are deployed. In all three steps, communication is the key to ensuring the model built addresses the right business objective, the right scoring metric, and can be deployed operationally.
NOTE: this post was originally posted for the Predictive Analytics Times at http://www.predictiveanalyticsworld.com/patimes/january13/