Data Darwinism: Market Driven Data Quality

February 20, 2011
63 Views

Just trying a little contrarian thought this week …

Have you ever noticed how much time and energy goes in to data validation?

I think it stems from visual forms development and the wide variety of clever data entry controls that are available – everyone wants to write an app that gets the oooo, cool! vote of approval. But how much of that energy spills over from value-added to feature creep?

Just trying a little contrarian thought this week …

Have you ever noticed how much time and energy goes in to data validation?

I think it stems from visual forms development and the wide variety of clever data entry controls that are available – everyone wants to write an app that gets the oooo, cool! vote of approval. But how much of that energy spills over from value-added to feature creep?

Regex complexity at its finest …

When your IT peers are showing off their internally developed tools, or when internal departments put so much creativity into their departmental data collection apps, try stepping back for a moment and taking a look at the amount of development, documentation, training, and maintenance work that gets generated. These amazing, subtle, and visually compelling methods for gathering and validating data can become complex validation rules that try to guarantee that only pristine data is ever added to the list.

Is all of this really necessary? Is there real value-add to this approach? Often times the coding of validation rules is so complex that the code becomes fragile, and burdensome on future maintenance programmers. Another common problem – many specialized, departmental, and/or narrowly vertical applications have broad ranges of acceptable data – and the rules for permissible values need to be wildly flexible and adaptive.

But how about NOT validating the input? Why not let “market forces” take over?

I am talking about instances where people are trying to get data into a System That Makes Some Problem Visible – for example, a database of projects or technical resource requests that have to be prioritized, or financial data that has to successfully post into a centralized data collection / aggregation system.

It might be easier to just document the requirements for the data, and then let the best quality data survive …

For your Project / Resource Prioritization application, a project will not get added to the prioritization list until all the data is complete and correct. Even if it is complete, it helps to make the project description easy to understand, compelling, and business relevant – or else someone else will get the resources.

Your monthly data submission has to conform to these [data structure] rules. If it does not conform, it will be kicked out / flagged with errors. You are responsible for getting your data cleaned up and compliant with the specification, and your data submitted by [the deadline] – else your submission will be late.

Now, this does put pressure on us to document the data formats and requirements clearly – but this is probably faster and easier than creating a gallery of automated rule checkers to validate input. And, when the document is proven to be complete, correct, and sufficient (i.e. not too complex), it would make a pretty good spec for an automated data validation program.

Just a wacky idea – as system designers, we don’t have to control the world. Try making market forces work in your favor, just like content struggling for readership on the internet or new products looking for sales …

… may the cleanest data win!