By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data Analytics instagram stories
    Data Analytics Helps Marketers Make the Most of Instagram Stories
    15 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    What to Know Before Recruiting an Analyst to Handle Company Data
    6 Min Read
    AI analytics
    AI-Based Analytics Are Changing the Future of Credit Cards
    6 Min Read
    data overload showing data analytics
    How Does Next-Gen SIEM Prevent Data Overload For Security Analysts?
    8 Min Read
    hire a marketing agency with a background in data analytics
    5 Reasons to Hire a Marketing Agency that Knows Data Analytics
    7 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Why normalization matters with K-Means
Share
Notification Show More
Aa
SmartData CollectiveSmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Why normalization matters with K-Means
Data MiningPredictive Analytics

Why normalization matters with K-Means

DeanAbbott
Last updated: 2009/04/03 at 6:13 AM
DeanAbbott
4 Min Read
SHARE

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I’m reporting unnormalized values for the “normalized” clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes.

More Read

predictive analytics for amazon pricing

Using Predictive Analytics to Get the Best Deals on Amazon

Predictive Analytics Helps New Dropshipping Businesses Thrive
Data Mining Technology Helps Online Brands Optimize Their Branding
Promising Benefits of Predictive Analytics in Asset Management
Can Data Mining Aid with Off-Page SEO Strategies?

Note that:
1) the results are different, as measure by counts in each cluster
2) the unnormali…

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I’m reporting unnormalized values for the “normalized” clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes.

Note that:
1) the results are different, as measure by counts in each cluster
2) the unnormalized clusters are dominated by TARGET_D and LASTGIFT–one cluster contains the large values and the remaining have little variance.
3) AGE and NUMCHILD have some similar breakouts (40s with more children and 40s with fewer children for example).

So, the conclusion is (to answer the original question) K-Means in Clementine does not normalize the data. Since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing removes this bias. However, whether or not one desires this removal of bias depends on what one wants to find: sometimes if one would want a variable to influence the clusters more, one could manipulate the clusters precisely in this way, by increasing the relative magnitude of these fields.

One last issue that I didn’t explore here, is the effects of correlated variables (LASTGIFT and TARGET_D to some degree here). It seems to me that correlated variables will artificially bias the clusters toward natural groupings of those variables, though I have never proved the extent of this bias in a controlled way (maybe someone can point to a paper that shows this clearly).

DeanAbbott April 3, 2009
Share This Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai low code frameworks
AI Can Help Accelerate Development with Low-Code Frameworks
Artificial Intelligence
data Analytics instagram stories
Data Analytics Helps Marketers Make the Most of Instagram Stories
Analytics
data breaches
How Hospital Security Breaches Devastate Local Communities
Policy and Governance
analyst,women,looking,at,kpi,data,on,computer,screen
What to Know Before Recruiting an Analyst to Handle Company Data
Analytics

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

predictive analytics for amazon pricing
Predictive Analytics

Using Predictive Analytics to Get the Best Deals on Amazon

8 Min Read
predictive analytics in dropshipping
Predictive Analytics

Predictive Analytics Helps New Dropshipping Businesses Thrive

12 Min Read
data mining
Data Mining

Data Mining Technology Helps Online Brands Optimize Their Branding

7 Min Read
analyst,women,looking,at,kpi,data,on,computer,screen
Predictive Analytics

Promising Benefits of Predictive Analytics in Asset Management

11 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?