A simple Data Transformation example…

December 2, 2008
120 Views

In my experience of customer focused data mining projects, over 80% of the time is spent preparing and transforming the customer data into a usable format. Often the data is transformed to a ‘single row per customer’ or similar summarised format, and many columns (aka variables or fields) are created to act as inputs into predictive or clustering models. Such data transformation can also be referred to as ETL (extract transform load), although


In my experience of customer focused data mining projects, over 80% of the time is spent preparing and transforming the customer data into a usable format. Often the data is transformed to a ‘single row per customer’ or similar summarised format, and many columns (aka variables or fields) are created to act as inputs into predictive or clustering models. Such data transformation can also be referred to as ETL (extract transform load), although my work is usually as SQL within the data warehouse so it is just the ‘T’ bit.

Granted a lot of the ETL you perform will be data and industry specific, so I’ve tried to keep things very simple. I hope that the example below to transform transactional data into some useful customer-centric format will be generic. Feedback and open discussion might broaden my habits.

Strangely many ‘data mining’ books almost completely avoid the topic of data processing and data transformations. Often data mining books that do mention data processing simply refer to feature selection algorithms or applying a log function to rescale numeric data to act as predictive algorithm inputs. Some mention the various types of means you could create (arithmetic, harmonic, winsorised, etc), or measures of dispersion (range, variance, standard deviation etc).

There seems to be a glaring big gap! I’m specifically referring to data processing steps that are separate from those mandatory or statistical requirements of the modelling algorithm. In my experience relatively simple steps in data processing can yield significantly better results than tweaking algorithm parameters. Some of these data processing steps are likely to be industry or data specific, but I’m guessing many are widely useful. They don’t necessarily have to be statistical in nature.
So (to put my money where my mouth is) I’ve started by illustrating a very simple data transformation that I expect is common. On a public SPSS Clementine forum I’ve attached a small demo data file (I created, and entirely fictitious) and SPSS Clementine stream file that processes it (only useful for users of SPSS Clementine).
Clementine Stream and text data files
my post to a Clementine user forum

I’m hoping that my peers might exchange similar ideas (hint!). A lot of this ETL stuff may be basic, but it’s rarely what data miners talk about and what I would find useful. This is just the start of a series of ETL you could perform.

I’ve also added a poll for feedback whether this is helpful, too basic, etc

– Tim

Example data processing steps

a)Creation of additional dummy columns
Where the data has a single category column that contains one of several values (in this example voice calls, sms calls, data calls etc) we can use a CASE statement to create a new column for each category. We can use 0 or 1 as indicators if the category value occurs in any specific row, but you can also use the value of a numeric field (for example call count or duration of the data is already partly summarised). A new column is created for each category field.

For example;

customercategoryscore
billfood10
billdrink20
benfood15
billdrink25
bendrink20

Can be changed to;

customercategoryscorefood_inddrink_ind
billfood101
billdrink201
benfood151
billdrink251
bendrink201

Or even;

customercategoryscorefood_scoredrink_score
billfood1010
billdrink2020
benfood1515
billdrink2525
bendrink2020

b) Summarisation
Aggregate the data so that we have only one row per customer (or whatever your ‘unique identifier’ is) and sum or average the dummy and/or raw columns.
So we could change the previous step to something like this;

customerfood_scoredrink_score
bill1045
ben1520

Link to original post