Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: It’s data, Jim, but not as we know it – Part 1: What the echo of the Big Bang tells us about the nature of information
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > It’s data, Jim, but not as we know it – Part 1: What the echo of the Big Bang tells us about the nature of information
Data Mining

It’s data, Jim, but not as we know it – Part 1: What the echo of the Big Bang tells us about the nature of information

TeradataEMEA
TeradataEMEA
9 Min Read
SHARE

Possibly I am just turning into a grumpy old man in my middle-age, but there are two words that when used together annoy me beyond almost all reason – yes, even more than the “p-word” that has featured in two of my previous posts: “unstructured” and “data.”

Despite what some vendors – and some commentators, who really should know better – would have you believe, there is nothing remotely formless or “unstructured” about “new” types of data, like image files, audio files, text-based documents, XML documents and so on. Of course for the most part these data hardly qualify as “new,” either, but don’t indulge my pedantry by getting me started down that road.

Data is merely information that has been encoded in some way and the only truly “unstructured data” is “noise”; random signals, representative of nothing much more than a system in equilibrium with its environment. A picture, a song, the complete works of Shakespeare – these are all forms of information and they are emphatically not “unstructured.”

To see the truth of this, take, for example, a GIF file (make sure that it is one that you don’t much care about, or a copy of one that you do) and open it with a text …

More Read

A Quick Look Back at Partners 2008
Why This Snaky Python Language?
Decision Tree Bagging
The Fallacy of the Data Scientist Shortage
James Taylor Reports on Predictive Analytics World Some trends:…

Possibly I am just turning into a grumpy old man in my middle-age, but there are two words that when used together annoy me beyond almost all reason – yes, even more than the “p-word” that has featured in two of my previous posts: “unstructured” and “data.”

Despite what some vendors – and some commentators, who really should know better – would have you believe, there is nothing remotely formless or “unstructured” about “new” types of data, like image files, audio files, text-based documents, XML documents and so on. Of course for the most part these data hardly qualify as “new,” either, but don’t indulge my pedantry by getting me started down that road.

Data is merely information that has been encoded in some way and the only truly “unstructured data” is “noise”; random signals, representative of nothing much more than a system in equilibrium with its environment. A picture, a song, the complete works of Shakespeare – these are all forms of information and they are emphatically not “unstructured.”

To see the truth of this, take, for example, a GIF file (make sure that it is one that you don’t much care about, or a copy of one that you do) and open it with a text editor. Now mess with and/or delete some of the bytes at random, save the adulterated file and then try and open it with your normal picture editing or viewing software.

In fact a GIF file is highly structured and includes meta-data in the header that, for example, includes a colour table; the height and width of the pixels represented by the bitmap that follows; whether the image is animated or still; etc., etc. All this meta-data is then followed by an array of bytes that define the actual bitmap bits and an end-of-file marker. Monkey with this file structure and you risk reducing the value of the data that it contains to peanuts; monkey with the actual data payload and you likewise either corrupt the file so that it can’t be read or so that it represents a different or a degraded image. Repeat this experiment with just about any multimedia file type and you will get the same result – either a corrupt file that cannot be read correctly or one that is no longer an accurate representation of the original object. These data are not only structured; the nature of that structure is critical to their correct interpretation.

And of course it’s not just the “wrapper” that has structure; the structure of the data itself is critical. Most people would interpret the statement “Dave didn’t marry Sue because she was rich” as meaning that Dave and Sue were married, but that Dave’s motivation for their union was not financial. Conversely, the statement that “Dave didn’t marry Sue, because she was rich” would probably be interpreted as meaning that Dave and Sue did not marry and that is was the difference in their circumstances that got in the way. A single structural element – one comma – makes a big difference to our interpretation of the “same” data. Suppose that during their courtship Dave tells Sue “I love you”; the structure of this sentence is identical to the structure of the sentence “I want you” (subject-verb-object, I think, but if I am mistaken and there are any linguists out there reading this, please feel free to correct me), but the two statements may or may not be synonymous (although I hear that Dave is a good guy, so perhaps we should give him the benefit of the doubt).

In fact, even apparently random noise can convey meaning. Tune a radio telescope to the microwave range of the electromagnetic spectrum and you will hear a faint hum, directionally uniform to 1 part in 500. This is quite literally a distant reverberation of the “Big Bang” in which the Universe was created and which confirms that the Universe was indeed once hot-and-dense, as the Big Bang theory demands that it must have been. That’s important information, as historically there have been other theories of the origin of the Universe that don’t assume an explosive beginning.

From measurements of the cosmic microwave background radiation, as it is called, physicists and astronomers are able either to infer or to calculate directly many other essential truths about the Universe, including the speed at which our galaxy is moving (600 kilometres-per-second towards the constellation of Leo, in case this answer is one day all that stands between you and the “who wants to be a millionaire?” prize money). It turns out that there is an awful lot of important information encoded in that apparently random noise.

Back on Earth, less exotic, “new” types of data are increasingly interesting to the commercial and government organizations that most of us serve. We should probably call these “multimedia data”, “non-record based data” or “non-relational” data. Actually, I’m not crazy about “non-relational” either; whilst this data is typically not relational in the accepted sense – the ordering of the bytes that define the bitmap in a GIF file is important, for example – this data can, after all, be accommodated in tables in a relational database using BLOB and CLOB objects. So long as we regard these objects themselves as atomic, it seems to me these data are as relational as any other attribute of an entity. Things clearly get more complex if we want to examine or “query” the objects themselves (“select all of the pictures in which the sky is red”), but let’s not go there for now.

My recent travelling companion and the main attraction on the “CTO Road Show” that we took on tour across the EMEA region in June – Teradata CTO Stephen Brobst – refers to “non-traditional data types” versus “record-based” or “square” data. These are definitions that I can live with. And I’m sure that engineering PhD Stephen will sleep easier for knowing that the flunky from marketing considers his use of technical vocabulary to be correct and not in the least aggravating!

 

Martin Willcox

TAGGED:data qualityunstructured data
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

data integrity
Big Data

3 Huge Reasons that Data Integrity is Absolutely Essential

7 Min Read

Days Without A Data Quality Issue

8 Min Read

Entry Point: Change is a Constant

5 Min Read

Put Data Quality in Those Requirements, Already!

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?