Integrating data and text analysis

Josh Becker of SubZero Wolf and Dave Froning of SAS presented on integrating text analytics and data analytics to make an impression. Text analytics has a lot of potential but success stories are not as widespread as you would think they should be nor are there many stories of operationalizing text analytics. The challenge, Dave thinks, is that text patterns alone are not enough to give you actionable information to change behavior – in repair technicians, for instance. To make it useful it must be both integrated with the operational business process and with structured data analytics. I, of course, would say that you should focus on decision points within the process and use the analytics (data or text) that will help you with that decision. Dave is on the same page.

There are lots of places in the warranty chain where text and data can be used to make better decisions. For instance, 10-15% of warranty cost is fraudulent and you want to find fraudulent claims before payment is made. Just using rules for fraud is not enough because fixed patterns defined in rules can be learned by fraudsters and because it is hard to find patterns across claims. You can use text and data analytics…

Dave presented using analytics as a separate step, post rules-processing, but I think this is all part of the decision – to approve or not. In particular I think there will be rules that could be made more precise if they used the scoring models as an input and putting the analytics after the rules prevents this. He did acknowledge that the analytics can generate new rules and that this is a way to close the loop (which is, of course, true).

Josh talked about the use of text analytics in the next step in the warranty chain – classification and coding of claims to support better incident and root cause identification. SubZero Wolf used to have 250 failure codes which non-technical folks were supposed to assign based on the technicians’ notes (which are obscure, full of jargon etc). Not only was this labor-intensive, it was inaccurate and correcting this created a long time lag when trying to do root cause analysis.

They have six free-form text and semi free-form fields. The text has problems – jargon, abbreviations, misspellings and more – but is the best information available. They implemented text analysis and have given up on numeric failure coding. Three models:

Failure Part Model
Picks up failure parts from the text using matching, synonyms etc
Failure Mode Model
Finds the failure modes listed in the claim
Service Part Family Model
Cleans up part name and text, pretty simple model

These then get fed into the analysis process. Their text models have proven to be very accurate with a maximum of 2% failing into the catchall cateogry where the models can detect anything. They have eliminated 65 days from their detect to correct cycle. Able to move 20 employees (1% of the total) from coding to more useful roles in the call center.

It is also possible to use text analytics to improve and refine an existing coding structure. The analytics can show that codes should be split, merged or where there is overlap – essentially using what people write to make the flags and codes more accurate.

He gave an example of a problem they had worked through in the past. They analyzed paid claims every few weeks and the particular example showed a problem that they detected 3.5 months earlier. This meant that 5,500 fewer defective units and this would have meant 1,500 fewer failures in the five year warranty period and this saved $475,000. And they do this many times.

Either way you need to integrate this into an early warning system to automate issue detection. Detecting new words is also useful for detecting new failure modes. They run some reports that analyze new words being found – those not being recognized. These can be very helpful in detecting new problems especially when introducing new technologies into a product range.

To find the problems that are being detected need to use data and text analytics as part of your problem definition process:

First you want to do some “fuzzy” search so can find all the synonyms and misspellings etc.
Does not matter how this works, have the system worry about it, just important to pull back the claims that are relevant.
Secondly you want to be able to use clustering – patterns of comments.
This let’s you subdivide into groups with different kinds of comments for the same kind of failure, for instance. Each cluster has quantitative variables too – structured data – so you can see what you know about the claims in a sector e.g. which supplier or which categories
Thirdly you want to be able to find similar claims
Before you start to drill into root cause analysis you want to pull all the claims that seem relevant and read them. Text analytics can be used to rank other comments in terms of how similar they are to the one that promoted the investigation. This helps focus on the claims that are most likely to describe the same root cause.

I liked the fact that the text analytics are being used to support both the transactional processes and the more investigative processes. Integration, as Dave says, is key – integration with data analytics and with rules and process.

That’s it for me. Hope you enjoyed the show.