Predictive Analytics World Addresses Risk and Fraud Detection

Eric Siegel focused his plenary session on predicting and assessing risk in the enterprise, and in his usual humorous way, described how big, macro or catastrophic risk often dominates thinking, micro or transactional risk can cost organizations more than macro risk. The micro risk is where predictive analytics is well suited, what he called data-driven micro risk management.

The point is well-taken because the most commonly used PA techniques are work better with larger data than “one of a kind” events. Micro risk can be quantified in a PA framework well.

During the second day, an excellent talk described a fraud assessment application in the insurance industry. While the entire CRISP-DM process were covered in this talk (from Business Understanding through Deployment), there was one aspect that struck me in particular, namely the definition of the target variable to predict. Of course, the most natural target variable for fraud detection is a label indicating if a claim has been shown to be fraudulent. Fraud often has a legal aspect to it, where a claim can only be truly “fraud” after it has been prosecuted and the case closed. This has at least two difficulties for analytics. First, it can take quite some time for a case to close, making the data one has for building fraud models lag by perhaps years from when the fraud was perpetrated. Patterns of fraud change, and thus models may perpetually be behind in identifying the fraud patterns.

Second, a there are far fewer actual proven fraud cases compared to those that are suspicious and worthy of investigation. Cases may be dismissed or “flushed” for a variety of reasons ranging from lack of resources to investigate, statutory restrictions, and legal loopholes which do not reduce the risk for a particular claim at all, but rather just change the target variable (to 0), making these cases appear the same as benign cases.

In this case study, the author described a process where another label for risk was used, a human-generated label that only indicated a high-enough level of suspicious behavior rather than only using actual claims fraud, a good idea in my opinion.