Building Neural Networks on Unbalanced Data (using Clementine)

November 1, 2009
50 Views

I got a ton of ideas whilst attending the Teradata Partners conference and also Predictive Analytics World. I think my presentations went down well (well, I got good feedback). There were also a few questions and issues that were posed to me. One issue raised by Dean Abbott was regarding building neural networks on unbalanced data in Clementine.

Rightly so, Dean pointed out that the building of neurals nets can actually work perfectly fine against unbalanced data. The problem is that when the Neural Net determines a categorical outcome it must know the incidence (probability) of that outcome. By default, Clementine will simply take the output neuron values, and if the value is above 0.5 the prediction will be true, else if the output neuron value is below 0.5 the category outcome will be false. This is why in Clementine you need to balance categorical outcome to roughly 50%/50% when you build the neural net model. In the case of multiple categorical values it is the highest output neuron value which becomes the prediction.

But there is a simple solution!

It is something I have always done out of habit because it has proved to



I got a ton of ideas whilst attending the Teradata Partners conference and also Predictive Analytics World. I think my presentations went down well (well, I got good feedback). There were also a few questions and issues that were posed to me. One issue raised by Dean Abbott was regarding building neural networks on unbalanced data in Clementine.

Rightly so, Dean pointed out that the building of neurals nets can actually work perfectly fine against unbalanced data. The problem is that when the Neural Net determines a categorical outcome it must know the incidence (probability) of that outcome. By default, Clementine will simply take the output neuron values, and if the value is above 0.5 the prediction will be true, else if the output neuron value is below 0.5 the category outcome will be false. This is why in Clementine you need to balance categorical outcome to roughly 50%/50% when you build the neural net model. In the case of multiple categorical values it is the highest output neuron value which becomes the prediction.

But there is a simple solution!

It is something I have always done out of habit because it has proved to generate better models, and I find a decimal score more useful. Being a cautious individual (and at the time a bit jet lagged) I wanted to double check first, but simply by converting a categorical outcome into a numeric range you will avoid this problem.

In situations where you have a binary categorical outcome (say, churn yes/no or response yes/no) then in Clementine you can use a Derive (flag) node to create alternative outcome values. In a Derive (flag) node simply change the true outcome to 1.0 and the false outcome to 0.0. 

By changing the categorical outcome values to a decimal range outcome between 0.0 and 1.0, the Neural Network model will instead expose the output neuron values and the Clementine output score will be a decimal range from 0.0 to 1.0. The distribution of this score should also closely match the probability of the data input into the model during building. In my analysis I cannot use all the data because I have too many records, but I often build models on fairly unbalanced data and simply use the score sorted/ranked to determine which customers to contact first. I subsequently use the lift metric and the incidence of actual outcomes in sub-populations of predicted high scoring customers. I rarely try to create a categorical ‘true’ or ‘false’ outcome, so didn’t give it much thought until now.

If you want to create an incidence matrix that simply shows how many ‘true’ or false’ outcomes the model achieves, then instead of using the Neural Net score of 0.5 to determine the true or false outcome, you simply use the probability of the outcome used to build the model. For example, if I *build* my neural net using data balanced as 250,000 false outcomes and 10,000 true outcomes, then my cut-off neural network score should be 0.04. If my neural network score exceeds 0.04, then I predict true; else if my neural network score is below 0.04, then I predict false. A simple derive node can be used to do this.

If you have a categorical output with multiple values (say, 5 products or 7 spend bands), then you can use a Set-To-Flag node in a similar way to create many new fields, each with a value of either 0.0 or 1.0. Make *all* new set-to-flag fields outputs and the Neural Network will create a decimal score for each output field. This is essential exposing the raw output neuron values, which you can then use in many ways similar to above (or use all output scores in a rough ‘fuzzy’ logic way as I have in the past:).

I posted a small example stream on the kdkeys Clementine forum http://www.kdkeys.net/forums/70/ShowForum.aspx
http://www.kdkeys.net/forums/thread/9347.aspx

Just change the file suffix from .zip to .str and open the Clementine stream file. Created using version 12.0, but should work in some older versions.
http://www.kdkeys.net/forums/9347/PostAttachment.aspx

I hope this makes sense. Free feel to post a comment if elboration is needed!

 – enjoy!

Link to original post