An Introduction to Unstructured Data

Organizations everywhere are beginning to feel mounting pressure to collect and store any form of data they can get their hands on. From nonprofits collecting data on the populations they serve to social media giants looking for ways to monetize, data of any kind is the raw good that most organizations this century will look to spin into piles of gold.

The challenge right now, however, is finding meaning and use from unstructured data that we don’t currently have the tools for. So what exactly is the difference between structured and unstructured data? The divergent labels imply a definition of more neatness than really exists in either type. Structured data is data that has been entered into a system of rows and columns that indicate a type or quantity. The most common example of structured data that the people encounter is a spreadsheet. Rows in your database indicate some type of thing that you’re keeping track of, and columns provide some indicator of the attributes of the item in the row. These database spreadsheets can be gargantuan in size or can be as simple as the primitive, paper-based double ledger developed by Paciloli. A data-filled kitchen of structured data would have all its forks, knives, and spoons neatly stashed in their compartments, the knives sharpened and smartly stored in their block on the countertop, the pans stacked in order of size and stowed snugly under the range. The world of unstructured data looks a lot more like the clutter-filled junk drawer, only the junk drawer has overflowed into the rest of the house. Unstructured data is a much broader category that exists in many forms. This data includes all of those emails stored in your account, the breadth of digitized music and movies floating around the Internet, the information silently generated by machines as they communicate with one another, and much more.

The world of unstructured data can look like anything from the binary counterpart to specific words, to pages of IP addresses with no other information attached. The swath of information covered under the unstructured category is the catalyst for the dichotomy of structured vs. unstructured data. Think about your last doctor’s visit. If you’ve been visiting the same doctor for some time, he or she probably has a wealth of information about you: blood test results, height and weight at your given age, your blood pressure range, and lots of other little notes about your comments or complaints and the nature of your visit. Though these charts and records are all written into forms and stored either in a paper file or a computer file, there is a lot of information in forms that today’s number-centric analysis tools simply can’t tackle. If your records have been entered into an electronic database, traditional analytics tools could parse through information about your weight and other numbers listed, but the wealth of doctor’s notes (the unstructured data) would be left behind. Another useful way to understand the grey areas between structured and unstructured data is through the concept of business intelligence. Over the course of decades, businesses have entrenched themselves in more and more technology to collect and store transaction data. Businesses carry plenty of structured information about a range of things: product flow, the customer demographics, changes in the prices of raw goods, income, and expenses. These numbers are well-behaved. As businesses became ever more entrenched in machine-based transactions, including internet purchases, businesses started storing more and more extraneous, less numerical information about transactions, simply because it existed. All of the data, structured or not, cleaned up or not, has been siloed into large collections of information that businesses were scared not to collect, simply because it could one day prove valuable.

These are not the only instances of unstructured data being stored in some sort of database to be labored over at a later date. The United States Census Board might have been a pioneer in making sense of massive amounts of data, but there are many unstructured forms of collected information that Census workers collected systematically, but that don’t fit neatly into rows and columns of a spreadsheet. While the ultimate delineation between structured and unstructured data might appear to be a structure, that’s only half of the truth, as many kinds of data that we are currently grappling with also have structures: words fit into linguistic patterns, photos have a composition, IP addresses have a logical order, and machines speak in script that has an order.

This is an excerpt from an ebook “Structured, Unstructured & Everything In Between”. Download it for free here.