Learning to Predict Death with Big Data

Death. Many of the discussions around data revolve around death. These discussions are typically pretty mundane. Why do phone calls die?

Death. Many of the discussions around data revolve around death. These discussions are typically pretty mundane. Why do phone calls die? Why do banks go out of business? What makes servers and other network machinery fail? It’s sad, but sometimes servers die, phone calls are lost, and some long-standing banks suddenly come to an end.

When that happens we use data to discover where it all went wrong. More accurately, we try to use data to discover where it all went wrong, poring through customer accounts data, cell tower network logs, profit and loss statements. That’s the post-mortem, a Big Data autopsy.

Typically these are done using logs or selected data sets, where an analyst pulls some segment of the data for analysis. If that data has structure, there are many tools for querying and analyzing it. For unstructured data, the long strings of numbers and letters that is the language of machines, solutions have begun to emerge that index the data, making it possible for analysts to search for the signals that will lead to the right answer. In each case, analysts and data experts must comb through data for a cause.

Not too dissimilar is the actual autopsy, which is itself a search for cause. Coming to a definitive cause of death can require a knowledgeable physician, but regardless of the skills of the investigator, prior knowledge of the subject can be critical. Some autopsies are straightforward, but we could easily imagine a situation where there are many contributing factors, some combination of toxins and disease that lead to an untimely death. Decades of experience allow a seasoned professional to compare observations with historical data, but without a perfect match, diagnosis becomes exploratory. That’s a near-perfect analogy to the state of data analytics today. We’ve moved beyond asking simple questions that have simple answers. The challenge of analytics is in finding the combination of events that have lead to death.

The big data specialist typically comes in one of two forms. The first is a practicing data expert, from junior analyst to seasoned data scientist. The second is a subject matter expert, a business user or technology expert with deep practical and applicable knowledge of the business. On the rare occasion that an analyst has both the technical skills and the deep knowledge of the data, their employers will bend over backwards to retain them.

Data science has become a “sexy” job in part because of how rare it is to find someone with the right combination of skills and know-how. These individuals can intuit the combinations of factors that might lead to a specific event better than a pure data person and manipulate data more skillfully than a typical business person.

This search for answers, however well-informed, has always been approached in an ad hoc manner. Skills and prior experience make for better analysis, but queries are subject to the limitations of human knowledge and the frailties of prior beliefs. Looking for an answer in data starts with a hypothesis, but without further insight, the search becomes exploratory.

What this means for practical application is that we have only scratched the surface of data analysis. For each query that we ask there could be hundreds, thousands, or more that go unasked. As our wealth of data continues to grow, the rate at which we can ask questions remains stagnant while the total number of questions we might possibly ask grows exponentially.

The Promise of Big Data

Of all cigarette smokers, rates of lung cancer death are estimated to range from 8-23%. Simple analysis tells us that smoking cigarettes is a risky activity, but the promise of more data is that, beyond the rate of lung cancer death, we can learn what differentiates the 8% from the remaining 92%. Genetic data, combined with the original smokers data, could reveal the genetic triggers that combine with smoking to differentiate additional conspiring causes of lung cancer death. Even then, we’re not likely to understand every factor that leads to a smoker’s death.

Coming to a reliable answer requires even more data. Not masses of data, but more diverse data. Environmental factors could be at play, like working conditions, sun or chemical exposure, rates of exercise, or levels of stress. Without taking into account this range of data, we simply don’t know.

The promise of Big Data is not that we can see, of all the smokers in the world, exactly what percentage of them will fall victim to lung cancer as a result of smoking. Big Data’s real promise is that across many different data sets, of all different sizes, we can reveal the multivariate factors that lead to death. This is a concept not limited to health data, but must be applied to almost every kind of data.

The Internet of Everything is another example. It will lead to the creation of exponentially more data, but the only thing currently being discussed is the size and speed of the data. When every device in our homes, cars, etc. is connected, we will have lost our ability to have meaningful and impactful analysis by data autopsy. There will be too much health data and too much sensor data for the search model to continue.

In order to realize the true benefits of data we will have to adopt methods that give us a holistic view across data sets without querying data.

Our opportunity for learning more about how we live and work won’t be measured by the size of the data centers it’s stored in, but by the knowledge that’s created when data is combined and analyzed. Connecting all this data across multiple sources and then being able to extract real, meaningful information should be the focus of every organization looking to discover real benefit from data, regardless of scale. Only then will we be able to identify emerging problems as they happen and act on that information before an autopsy is necessary.