# Why normalization matters with K-Means

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I’m reporting unnormalized values for the “normalized” clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes. Note that:
1) the results are different, as measure by counts in each cluster
2) the unnormali

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I’m reporting unnormalized values for the “normalized” clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes. 