Sensemaking on Streams – My G2 Skunk Works Project: Privacy by Design (PbD)

Over the last twenty-eight months I have been quietly running a skunk works effort that I’ve code named “G2.” To my delight, on January 28th, 2011 this system became officially viable and will be entering something akin to a “sea trial” phase through 2011.

I believe this system will prove to be my most innovative work to date. I also believe it is the most responsible technology I have ever created to date.

This new technology, something that might be characterized as a “big data analytic sensemaking” engine, is designed to make sense of new observations as they happen, fast enough to do something about it, while the transaction is still happening. This engine brings to life many of the principles I have been openly sharing on my blog, ranging from Sensemaking Systems Must be Expert Counting Systems, Data Finds Data, Context Accumulation, Sequence Neutrality and Information Colocation to new techniques to harness the Big Data/New Physics phenomenon. That said, as this is version 1.1, there remain many things to do to realize my full vision. It is a very ambitious effort, but more about that some other day.

In terms of responsible innovation, I am even more proud to report that my team and I have baked in, from conception, more privacy and civil liberties enhancing technologies than any other product I am aware of to date.

Friday, January 28^th, 2011 – my official launch date – also happened to be the international Data Privacy Day. And on this day, internationally recognized privacy commissioner, Ann Cavoukian hosted a few hundred privacy executives and practitioners from around the world in Toronto Canada at her Privacy by Design: Time to Take Control conference. During my keynote entitled “Confessions of an Architect” I highlighted seven (7) exciting Privacy by Design (PbD) features that have been baked into this new tehnology, specifically:

1. Full Attribution

2. Data Tethering

3. Analytics in the Anonymized Data Space

4. Tamper-Resistant Audit Logs

5. False Negative Favoring Methods

6. Self-Correcting False Positives

7. Information Transfer Accounting

The full presentation is here.

Here is a summary of the above seven PbD features:

1. FULL ATTRIBUTION: Every observation (record) needs to know from where it came and when. There cannot be merge/purge data survivorship processing whereby some observations or fields are discarded. Why is this so important?

A. If received data does not contain its data source and transaction pedigree, then system-to-system reconciliation and audit are virtually impossible, especially in large information sharing environments.

B. If the system merges and purges observations, only later to discover the wrong observations were merged or purged, then without full attribution correcting these earlier mistakes can be difficult if not impossible. The typical alternative being periodic batch re-processing.

C. The Universal Declaration of Human Rights has four articles containing the word “arbitrary” e.g., Article 9 reads “No one shall be subjected to arbitrary arrest, detention or exile.” If you don’t know where the data came from or when, how can any resulting action be anything but arbitrary?

2. DATA TETHERING: Adds, changes and deletes occurring in systems of record must be accounted for, in real-time, in sub-seconds. Why is this so important?

A. Data currency in information sharing environments is important, especially if one is making important, difficult to reverse decisions that affect people’s freedoms or privileges.

B. When derogatory data is removed or corrected in a system of record, it is vital to reflect such corrections immediately. For example, if someone is removed from a watch list, how long should they have to wait before their name is cleared?

3. ANALYTICS ON ANONYMIZED DATA: The ability to perform advanced analytics (including some fuzzy matching) over cryptographically altered data means organizations can anonymize more data before information sharing. Why is this so important?

A. With every copy of data, there is an increased risk of unintended disclosure.

B. Data anonymized before transfer and anonymized at rest reduces the risk of unintended disclosure.

C. If organizations can now share information in an anonymized form and still get a materially similar result, why would organizations want to share information any other way?

[Technical Note: As every anonymized value maintains full attribution, re-identification is by design to support Data Tethering as well reconciliation and audit.]

4. TAMPER-RESISTANT AUDIT LOGS: Each record of who searches for what should be logged in a tamper-resistant manner – even the database administrator should not be able to alter the evidence contained in this audit log. Why is this so important?

A. Every now and then people with access and privilege take a look at records without a legitimate business purpose, e.g., should an employee at a financial services institution take a peek into their roommate’s file.

B. Tamper-resistant logs make it possible to audit user behavior.

C. And, when the word gets out to the work force that such accountability exists, this can cause a chilling effect on misuse.

5. FALSE NEGATIVE FAVORING METHODS: The ability to more strongly favor false negatives is of critical importance in systems that could be used to adversely affect someone’s civil liberties. Why is this so important?

A. In many business scenarios, it is better to miss a few things (false negatives) than inadvertently make claims that are not true (false positives). False positives can adversely affect people’s lives – e.g., the police find themselves knocking down the wrong door or an innocent passenger is denied the ability to board a plane.

[Technical Note: Sometimes a new observation can lead to multiple conclusions. Systems that are not false negative favoring may select the strongest conclusion and ignore the remaining conclusions. But had the strongest candidate not existed, the second strongest conclusion would be asserted. One false negative favoring method involves remedy such a condition, for example by reversing an earlier conclusion should a future observation bring to light that fact that multiple possible conclusions now exist.]

6. SELF-CORRECTING FALSE POSITIVES: With every new observation presented, prior assertions are re-evaluated to ensure they are still correct, and if no longer correct, these earlier assertions can often be repaired – in real-time, not end of month. Why is this so important?

A. False positives occur when an assertion (claim) is made, but is not true. False positives can adversely affect people’s lives e.g., consider someone who cannot board a plane because he or she shares a similar name and date of birth as someone else on a watch list.

B. Without self-correcting false positives, databases start to drift from the truth and become provably wrong (even to the naked eye) – necessitating periodic (batch) reloading to true-up the database.

C. Periodic monthly reloading to correct for false positives means wrong decisions are possible all month until the next reload, even though the system had everything it needed to know beforehand.

[Technical Note: Reversing earlier assertions in real-time at scale, as new observations present themselves, is computationally non-trivial. Imagine making an assertion that two people are the same because they share exactly the same name, address and home phone number – only later to learn through another series of observations that these are really two different people (a junior and a senior). Our “self-correcting false positives” feature self-corrects for these rare cases, in real-time. We consider our ability to perform sequence neutrality at scale one of several breakthrough aspects of our work.]

7. INFORMATION TRANSFER ACCOUNTING: Every secondary transfer of data, whether to human eyeball or tertiary system, can be recorded to allow stakeholders (e.g., data custodians or the consumers themselves) to determine how their data is flowing. Why is this so important?

A. It is often cumbersome to learn who has seen what records, or what records have been shared with tertiary systems.

B. Much like a US credit report that contains an inquiries section exposing the list of recent inquiring parties, now so can your medical or financial file.

C. Users can now be easily provided with such disclosures, increasing transparency and control e.g. enabling a consumer in some cases to request an information recall.

D. When there is a series of leaks, information transfer accounting makes discovery of who accessed all records in the series quite trivial. This can narrow an investigation when looking for criminals within.

What has me most excited is that where some features above would typically be an extra priced option in my new system so many are built in (e.g. this tamper-resistant audit logs). And some of our privacy and civil liberties enhancing features cannot even be turned off!

Yes, there is an official name for my new technology. And no, I’m not telling you, because this is not a sales pitch. Rather, I am simply trying to inspire other technologists to consider Privacy by Design as they innovate.

I’ve had two most great days at IBM. The first great day was in January 2005 when IBM bought my company, SRD. And the second greatest day came six years later on January 28^th, 2011.

RELATED MATERIAL:

Privacy by Design (PbD)