Interview: The Need for Big Data Governance

First, what’s your background?

I have worked in the BI and EIM world for about 22 years. There’s lots of talk these days about the role of data scientists – I actually started off my career as a scientist using data. Before I joined the IT industry, I created and ran models of the effects of acid rain on rivers and lakes. It’s not unlike business simulations – you model the processes using real-world, historical data and then do “what if” analysis and try to predict the future.

I was attracted by the bright lights of the IT industry, and worked for SAS for a while, then joined Business Objects in 1999 to run product marketing in the UK. I had many different marketing roles over the years, worked in European marketing after the acquisition by SAP, and finished off as head of business development in the UK.

I joined ENTOTA in May last year, the leading SAP Data Services consultancy in EMEA. The company focused on services for data migration and governance projects, helping implement data warehouses, etc. – basically, anything to do with data, the plumbing that the users often don’t see.

One of our specialties is that we can advise SAP customers on best-practice Data Services implementation. Like any application, it has to fit into an existing IT landscape, and you have to have an implementation framework in place – for example, a lot of ETL coding is often outsourced, so you need standards on how to write code, debug code, etc. so that developers can work together. It sounds quite a niche area, but it’s very important for the success of EIM projects.

In December, ENTOTA was acquired by BackOffice Associates, a US-based world leader in data migration and information governance solutions. The company has a broad software portfolio including a data stewardship platform that helps organizations manage the whole process of information governance. ENOTA was a good fit: BackOffice Associates already support and OEM SAP Data Services, and ENTOTA’s strong European presence was a plus. The company also has a lot to bring to ENTOTA’s customers – over 17 years of experience of working with data, and so have a huge amount of content in terms of data governance and best practice and hundreds of data experts.

What’s your presentation about?

The theme of the SAP Forum is obviously innovation, including topics like big data, mobile, and cloud. The danger with shiny new “Big Data” projects is that we get sucked into the hype and throw out the disciplines that we’ve learned over the years. Big data is clearly going to create some interesting data governance problems. I’m not a big fan of the term “big data,” but if we take the familiar 3Vs definition [volume, velocity, variety], this new data is going to put lots of stress on existing data governance processes – assuming there are any, because data governance is already a challenge for most organizations. In general, the move to big data will exacerbate existing problems.

Do you think people underestimate the need for data governance in big data scenarios?

Yes. Whether it’s your SAP ERP system or a big data database, you have to think about how bad data can and will get into your system. Whatever the underlying technologies, the principle causes of poor data quality are going to be the same. The temptation is that lessons learned over the years are thrown out during the dash to big data. These projects are often driven by a need for better analytics and if there’s one thing that’s going to expose bad data, it’s a BI system.

We’ve seen an increasing interest in real-time analytics, and the need for lean processes to get the data immediately into the hands of users who can do something with it. However, it doesn’t matter whether you’re talking about gigabytes or petabytes — bad data is bad data.

There’s no silver bullet, no shortcut to fantastic business insight just because you’re using a new technology. It may sound boring, but you have to pay attention to data quality, and cut bad data off at the source using a “data quality firewall” approach to your big data repository.

So how do you improve data quality?

There are three main ways bad data gets into systems, and they’re all essentially technology-agnostic.

The first is during data migration. Before you go live on a new system, you will normally bulk load some information. If your initial data load contains poor quality data, it can be really expensive to fix. If you’re talking about an ERP system, it can break essential business processes like being able to bill customers. A big data project could lose credibility with the users if they see a lot of data issues. It’s simpler and cheaper to prevent bad data getting in in the first place.

The second source of bad data is the day-to-day data movement once the system is up and live. No application is an island. In a big data project, the feeds could be from sensors, ERP systems, SAP HANA, Hadoop, etc. Taking the firewall approach, those real time and batch interfaces have to apply data governance rules consistently.

The third source is human beings going about their daily work and entering bad data – missing or incorrect values for example. Best practice says the closer you apply the business rule to the source, the cheaper it is to fix.

The right approach to all these problems is to have a data quality “firewall” that filters data rather like internet traffic. And you can’t create that firewall unless you first have a definition of what “good data” looks like. IT will have technical definitions of good data – no characters in a number field, for example. But ultimately, only the business knows what defines business-ready data, therefore IT has to collaborate with them to create the business rules. And those business rules need to be in place before any new big data project, not after.

The real trick is to apply those business rules everywhere, not just on a project by project basis. The worst approach is to create the business rules, and store them on a server somewhere, but fail to use them. It’s like the old notion of “actionable information” in BI. Just as there’s no point in having a nice dashboard if you don’t do anything with the information.

The second problem is that even when the business rules have been defined, they are applied in a piecemeal way across the business. Its like “Chinese whispers” – as the rules spread, they change, and get implemented differently in multiple projects and regions. The result is separate islands of data governance, and then you still have a data quality problem by definition. Good data governance requires a hub approach, with a central place for rules that are applied consistently across all systems.

So far, we’ve talking about traditional structured data. What about newer, less-structured data such as social media feeds–doesn’t that complicate the notion of “data quality”?

That’s the question we will increasingly have to answer in the next few years: how can we apply the same data governance rules that are well-understood in the structured world to unstructured/semi-structured data? One common challenge is that it’s often a completely different group of people that is in charge of the unstructured data. The priority must be to get everybody talking together — invite the new social-media savvy folks and the traditional structured data experts to have a discussion about the common business rules. To get the most out of data, we have to think holistically about information governance across the organisation.

What about other aspects of data governance?

Data quality is just part of governance. Another is compliance and security. Today, people may have been sold a vision that big data is a “big bucket” that they can analyze in the hope of finding some great nugget of business insight. But that data still has to have a series of compliance rules around it: who can access it, how long can you keep it, what is the archive strategy etc.

Matt Aslett of 451 Research wrote a great blog post on how the economics of data are changing, making it easier than ever to store large volumes of data. We still have to apply the concepts of information lifecycle management to these large data sets, which are really just another set of business rules.

Any final thoughts on why should people attend your presentation?

We’re at the start of an interesting journey exploring the value of big data. However, Big data projects will fail if you don’t address data governance. It’s the same as it always has been – just harder!

Thanks for your time!

Presentation Information: March 11, 2014. 11:15am Richard Neale, Head of Product Marketing for ENTOTA (Now part of Back Office Associates ) Gold Sponsor:

Are you ready for big data governance?
As interest and investment in big data and the internet of things grows, we are about to see an explosion in the types and volume of data that organizations store and analyze as we set about extracting value from these data assets. Given that data quality and data governance is already a challenge for most organizations due to a largely ad-hoc approach, this challenge is set to increase. It is imperative, therefore, that organizations get a handle on data quality and data governance now before they are engulfed in a rising tide of data. In this session, we will explore how to resolve the three root causes of poor data quality, how to get started with data governance and examine why you can’t fix data quality without standards or without collaborating with the business.