BI’s Dirty Secrets – The Unfortunate Domination of Manually-Coded Extracts

Manually-coded extracts are another dirty secret of the BI world. I’ve been seeing them for years, in both large and small companies. They grow haphazardly and are never documented, which practically guarantees that they will become an IT nightmare.

How have manually-coded extracts become so prevalent? It’s not as if there aren’t enough data integration tools around, including ETL tools. Even large enterprises that use the correct tools to load their enterprise data warehouses will often resort to manually-coded extracts to load their downstream BI data sources such as data marts, OLAP cubes, reporting databases and spreadsheets.

After seeing this problem in enough client companies, I’ve got a few theories as to why it happens:

Money: The top tier tools are expensive. They are out of reach for SMBs and can even be too expensive for large enterprises to expand their use from the EDW to BI data source. There are data integration tools that would do a great job spanning price ranges, but for the most part nobody knows about them. And when they are used, they are misused (see below), so their reputation for producing a solid business ROI is diminished.
Stretched resources: In large enterprises, the centralized data warehouse team likely has data integration experience, but their backlog of work means that people creating BI data sources are on their own. So they end up hand-coding. In SMB firms, the IT staffs are too small to dedicate anyone to data integration, so no one is an expert.
Data never sleeps: Regardless of the state of data integration expertise and investment at an enterprise, business people still have to run and manage the business. This requires data. If the data has not been integrated for them, they’ll figure out some other way to get it — even if it means cutting and pasting data from spreadsheet queries or getting IT to “crank out” SQL scripts. This is why data shadow systems or spreadmarts get started and then become so prevalent.
You don’t know what you don’t know: Even when enterprises use data integration or ETL tools, they often don’t use them well. The biggest reason why people misuse these tools is that they don’t have a firm grasp of the concepts of data integration processes and advanced dimensional modeling. Tools are easy; concepts are harder. Anyone can start coding; it’s a lot harder to actually architect and design. Tool vendors don’t help this situation when they promote tools that “solve world hunger” and limit training to the tool, not any concepts.

So, here’s what happens: instead of using data integration best practices, people design the ETL tool processes the same way they would create a sequential series of SQL scripts to integrate data. In fact, many an ETL process simply executes stored procedures (SP) or SQL scripts. Why use the tool at all if you’re not going to use its, capabilities? When this happens, IT figures it was a waste of time to use the ETL tool to begin with, and the ETL tool investment had no ROI. This becomes a self-reinforcing loop enabling IT to justify (or rationalize) manual coding.

Coding is easier than thinking: There is an inherit bias for the IT staff to generate SQL code. They know it (just like the business person knows spreadsheets), they can crank something out quickly and it does not cost anything extra. The typical scenario is that the IT person creates a SQL script or a stored procedure to pull data from one source and things are fine. But then several hundred SQL scripts or stored procedures later, the hodgepodge and undocumented accumulation of pseudo ETL processes becomes the recurring method to load the data warehouse or BI data sources. Each change to that set of code takes longer and longer. It consumes more and more resource time just to maintain it. When new data needs to be integrated, another IT person starts the next hodgepodge of undocumented code with yet another simple SQL script.

How do we get out of this mess? Stay tuned for a future blog post.