Let’s Get Our Facts Straight About Big Data Privacy




Which of the following statements would you say are true about (big) data privacy?

  • “I’ve got nothing to hide.”
  • Privacy policies apply to data, not users.
  • A single privacy policy will suffice for all your data.
  • Anonymized data keeps my personal identity private.
  • Privacy is dead.

The answer: None of them are true. Seriously.



Which of the following statements would you say are true about (big) data privacy?

  • “I’ve got nothing to hide.”
  • Privacy policies apply to data, not users.
  • A single privacy policy will suffice for all your data.
  • Anonymized data keeps my personal identity private.
  • Privacy is dead.

The answer: None of them are true. Seriously.

About the data ecosystem

Before I set the record straight on the above statements, let’s take a brief look at the world of data through four layers: users, data, applications, and platforms. Understanding these layers will be useful as we talk about privacy policies in the next section.

Users. It’s a common best practice to segment users by audience, or more specifically, by how they’re going to use the data. Once these categories are identified, you can then establish privacy policies for each one.

Here’s some sample user categories: standard report users, ad hoc query users, remote partners/suppliers, power users, business executives, statisticians/data scientists, executives and the board of directors, and HR.

In your own companies, you will want to create user segments that fit your structure and culture.

Data. Now let’s look at the data layer. Data classification is a great exercise because it begins to force the conversation that there are different “types” of data and that data is not “one size fits all.” This type of exercise also makes it easier to assign ownership to data. Sample types of data include: transactional, master, metadata, controlled, reference, reporting, analytical, open, cross-functional, historical, department-specific, and process-specific.  

Another way to classify data is by first-, second-, and third-party:

  • First-party data is the data that a company has collected directly from the customer, such as website clicks, CRM data, order transactions, social data, cross-platform, and mobile data. This is the best type of data to have. The company has permission to have it.
  • Second-party data is some other company’s first-party data. Through a partnership or agreement, two companies share customer data for a cross-promotional campaign, for example. Or a site or application allows you to login with a social profile, like Google or Facebook, instead of making you set up yet-another username and password for yet-another site.
  • Third-party data is the collected, aggregated, and anonymized data that is typically sold by data brokers. This data is widely available, including your competitors. This is the data that consumers have little to no control over, and most likely, have not given explicit permission for you to use.

Still another way to classify data is based on its security. This is a great method of collaboration, especially for government, financial services, and health care providers for whom data is tightly regulated and privacy legislation is a reality. In this classification method, each data category has its own set of specific policies, which further drive not only who is allowed to use the data, but who is allowed to modify and administer it.

There’s no right or wrong way to classify your data. Each company is different. But once you have the data categories, you can then apply your privacy policies.

Applications. Next up is the apps – the user’s primary interface to the data. This can be a desktop app that gives you access to your company’s CRM system, a browser app like Dropbox that lets you store your data in the cloud, or a mobile app like Waze that helps you avoid the traffic jam up ahead. It’s also the BI and analytics apps you use to develop and execute reports and data visualizations.

Whether it’s an enterprise app or a personal app, privacy policies are typically established around who can access what data.

Platforms. The final layer is the platform. This is where the data is stored and processed. This includes the structured data in your operational systems and data warehouses, the Hadoop cluster parked in your company’s data center, and the Twitter data stored in the Amazon cloud.

Even though most user access to the platform can be controlled through an app, you will want to establish privacy policies at the platform level, too – just in case someone – rightfully or wrongfully – wants to bypass the app and access the data directly in the platform.

When you put it all together, it boils down to: Users use apps to create or access data that’s stored in a data platform. It’s important that privacy policies be established at each layer.

Setting the record straight

Now we’re ready to address the five statements, or beliefs about data privacy, mentioned earlier:

Belief #1: “I’ve got nothing to hide.”

Have you said or heard this one: “I’ve got nothing to hide”? Or how about: “If I have done nothing wrong, I have nothing to worry about.”

If this is what you believe, you’re missing the point. Statements like this are just a distraction from the real discussion of online privacy. Consider this: Every day, someone new is coming online. Maybe it’s a young person who just got his first iPhone or it’s someone in a region who’s just getting affordable access for the first time. They don’t know the rules. You may not care about being openly tracked, but don’t put them into a dangerous situation by letting them believe the internet is safe. Because it’s not.

Hence, a more accurate statement would be: We’ve all got something to hide. It just depends from whom. Be responsible for your own data and help others be responsible for theirs.

Belief #2: Privacy policies apply to data, not users.

Some companies still operate under the belief that if the data is well categorized and policies are placed on it, then who uses it doesn’t matter because you’ve got the privacy in place. This is a wrong assumption because users can and will find a way to get to the data they need.

Hence, a more accurate statement would be: As discussed in the previous section, privacy policies apply to every layer in the data ecosystem – to data, users, apps, and platforms – not just data.

Belief #3: A single privacy policy will suffice for all your data.

A good example of a single privacy policy approach is the privacy policy on a company’s website. You know the one: the policy that we quickly clicked through – and didn’t read – when we initially set up our account on the site. Granted, this consumer-facing policy is very important, and it continues to evolve as we progress down the big data privacy path. But as we’ve been discussing, it’s not the only one your company should have.

Hence, a more accurate statement would be: You need multiple privacy policies for your data categories – not to mention policies for your users, apps, and platforms. But I repeat myself.

Belief #4: Anonymized data keeps my personal identity private.

This is one of my favorites. You’re probably familiar with the concept of de-identifying or anonymizing data. In simple terms, it means removing any information from a data set that could personally identify a specific individual; for example, the person’s name, a credit card number, a social security number, home address, etc. Companies that sell consumer data, such as data brokers, typically only sell anonymized, and often aggregated, data. So what’s the big deal?

The big deal is this: With today’s big data technologies, it’s becoming easier to re-identify individuals from this anonymized data. Programming techniques continue to be developed to pull these anonymized pieces back together from one or more data sets. So if a company says it anonymizes your data before passing it onto others, be aware that your identity could still be revealed through advanced re-identification techniques.

Hence, a more accurate statement would be: Individuals may be re-identified from anonymized data. [I will be discussing this topic more in my next post.]

Belief #5. Privacy is dead.

The “privacy is dead” declaration is not new. Here’s a few CEO examples:

  • About 15 years ago, Scott McNealy of Sun Microsystems said: “You have zero privacy anyway. Get over it.”
  • Then 10 years later, Eric Schmidt of Google said, “If you have something that you don’t want anyone to know, maybe you shouldn’t do it in the first place.”
  • In 2010, Mark Zuckerberg of Facebook did an about face when he argued that privacy is no longer a social norm.

This was 5+ years ago and these are brilliant men, but they’re wrong. Privacy is not dead and we should not get over it.

Hence, a more accurate statement would be: Privacy is not dead. Yet. The focus on big data privacy issues is escalating rapidly – and as more people begin to understand what’s at stake, things will change and are beginning to change.

The bottom line

In this big data world in which we live, we are being forced to address privacy issues – whether we want to or not – as consumers, citizens, and employees in the private and public sectors. Let’s get our facts straight about big data privacy and help others do the same.