Once Upon a Time in the Data

“Once upon a time, and a very good time it was, there was a moocow coming down along the road, and this moocow that was coming down along the road, met a nicens little boy named baby tuckoo . . .”

“Once upon a time, and a very good time it was, there was a moocow coming down along the road, and this moocow that was coming down along the road, met a nicens little boy named baby tuckoo . . .”

That is the opening line from the novel A Portrait of the Artist as a Young Man by James Joyce.


This novel’s unstructured data can be quite challenging, especially the opening chapter since it is written from the perspective of a young child discovering both the world and the words used to describe it.

Harry Levin, editor of a collection of Joyce’s work, commented that “the novelist and the poet, through their command of words, are mediators between the world of ideas and the world of reality.”

All data professionals, through their command of data, are also mediators between the world of ideas—whether recorded in the structured data of relational databases and spreadsheets, or the unstructured data of documents and social media content—and the world of reality, which is what all of that structured and unstructured data are discovering and attempting to describe.


Data is not Literal

As I have written about in previous posts, whether it’s an abstract description of real-world entities (i.e., “master data”) or an abstract description of real-world interactions (i.e., “transaction data”) among entities, data is an abstract description of reality.


The inconvenient truth is that the real world is not the same thing as these abstract descriptions of it—not even when we believe that data perfection is possible (or have managed to convince ourselves that our data is perfect).

Although real-world alignment is a good definition for data quality, there is always a digital distance between data and reality.  Data is not literal, which means that it can never literally represent reality—data can only describe reality.


Data is Literary

There is a structure in the unstructured data of novels and poetry, but it’s eerily reminiscent of the structure we impose on reality by describing it with data.  A novel is a narrative creating a static and comforting—but fictional—semblance of reality.

To make sense of a novel or a poem—or of any data—we must enter its reality, we must believe that its fiction is fact.


Samuel Taylor Coleridge explained the necessity of believing in this “semblance of truth sufficient to procure for these shadows of imagination that willing suspension of disbelief for the moment, which constitutes poetic faith.”

“The final belief,” Wallace Stevens once wrote, “is to believe in a fiction, which you know to be a fiction.”  Stevens believed that reality is created by our imagination, which we use to understand the constantly changing world around us.  “Reality is the product of the most august imagination.”

Data is a fiction we believe in, which we know to be a fiction.  Data is not literal, but literary—data tells us a story.


Data is a Storyteller

Our data tells us stories about the real world.  Data quality is concerned with how well these stories describe who was involved and what happened.  Master data are the story’s characters and subjects, while transaction data are the events and interactions comprising the narrative of the story.  Let’s use a simple (and fictional) example:


Michelle Davis-Donovan purchases a life insurance policy for her husband Michael Donovan from Vitality Insurance.

The characters are Michelle Davis-Donovan, Michael Donovan, and Vitality Insurance.  The event bringing them together is the purchase of what becomes the subject of the story that connects them, a life insurance policy, around which a narrative forms.

One of the recurring interactions in the narrative are the premium payments that Michelle sends to Vitality.  Another event, occurring later in the story, is Michael’s unexpected death, which triggers both the end of the premium payments and the beginning of the processing of the insurance claim, eventually resulting in a payment made to Michelle by Vitality.

In data management terms, Michelle Davis-Donovan and Michael Donovan (Customers), the life insurance policy (Product), and Vitality Insurance (Vendor) are all master data, and the life insurance premium and claim payments are transaction data.


It may be tempting to think of the similar stories in our databases as non-fiction, as a historical account describing real-world people and events.  After all, it’s probably safe to assume that Vitality Insurance had verified that Michelle had, in fact, paid the premiums on the life insurance policy, as well as verified that Michael was, in fact, dead, before cutting a check for the claim.

But even history is a convenient fiction, which is open to revision based on the presentation of newly discovered “facts.”

Let’s imagine that Michelle starts a new chapter in her life’s story by changing her given name to Clarissa and then marrying Richard Dalloway.  Mrs. Dalloway then purchases a life insurance policy for her husband from Vitality Insurance.

After a few years of bank verified premium payments made by Clarissa to Vitality, Richard unexpectedly dies.


How is this reality described by the data managed by Vitality Insurance?  Is Clarissa Dalloway the same real-world person as Michelle Davis-Donovan?  Is Michelle, if that’s even her real name, killing her husbands to collect on their life insurance policies?

No doubt there are characters, subjects, events, and interactions like these to be found in the stories your data is telling you.

Is your data fact or fiction?  More specifically, is your data a fiction that you feel you have to believe in?


Once Upon a Time in the Data

Stephen Dedalus, the protagonist of A Portrait of the Artist as a Young Man, was James Joyce’s literary alter ego, some aspects of which accurately described him and his actual real-life experiences.  Does this make author and character literally equivalent?


Would data matching routines identify Stephen Dedalus and James Joyce as duplicate customers?

What about your data?  I do not mean the data you work with as a data professional, I mean your personal data.  How many companies view you as a customer?  How many companies have master and transaction data that is telling stories about you?

All of that data is your literary alter ego.  Is that data fact or fiction?  Are all of those stories about you true?

I am pretty sure that the companies believe so, but does every aspect of that data accurately describe you?  Do these stories tell the truth about your current postal addresses, e-mail addresses, and telephone numbers?  Do these stories tell the truth about your age, the number of times you have been married, or how many children you currently have?


I often wonder about my personal data that is roaming countless databases in countless companies, telling stories about how:

“Once upon a time in the data, and a very good time it was, there was some customer data entered, and this customer data that was entered, told the story of a nicens real-world person named Jimmy . . .”

The Future of Our Data’s Story

Data privacy and protection are increasingly prevalent topics of discussion, especially in relation to data moving into the cloud.  Earlier this year, I wrote a blog post that examined some of the impacts of the semantic web on the future of data management.

Lately I’ve been thinking about how these two trends could provide customers with greater control over their literary alter egos, giving them more control over their personal data—and the stories that it could tell.


Perhaps when this finally happens, our data’s story will become more fact than fiction.