The one topic that received significant coverage during the event was the emerging role of the ‘data scientist’ – a term apparently coined by Jeff Hammerbacher, the co-founder and Chief Scientist of Cloudera while he was at Facebook. The McKinsey Global Institute recently published a study that forecast that the shortage of these skills could be as high as 190,000 people by 2018 in the US alone. The notion of such a discipline bothered me quite a bit and until now was not able to put my finger on it. In full disclosure, my educational background includes degrees in Mathematics, Computer Science and Operations Research and I have spent most of my career helping companies deal with data and extract insights so they can make better decisions. But I am getting ahead of myself…
What is big data?
The definition of big data was, to my surprise, not a controversial topic. Most speakers agreed that big data is both about the quantity and quality of the underlying data, i.e., volume measured in petabytes (1015 bytes or 1M gigabytes or more), and data that does not only include structured but also unstructured (i.e., text, video, social media, etc.) data as well. You can read Wikipedia’s definition here.
Incredible innovation at the data management layer
The field of big data has seen an explosion of a new alphabet soup over the past few years (ACID, Cassandra, Hadoop, HBase, Hive, NoSQL, MapReduce, Pig, and many more). Many early-stage (Cloudera, Kognitio, Netezza, ParAccel) and established (EMC through its acquisition of Greenplum, Microsoft through its acquisition of DATAllegro, Oracle, SAP, and Teradata through its acquisition of Aster Data) technology companies are innovating at an unprecedented pace to help their customers deal with the big data deluge.
While this innovation at the data management layer is significant, most discussions around the data scientist in the industry today are focused at the predictive analytics / data visualization level of extracting insights from big data, and this is wherein my fundamental disagreement lies:
Field is not new – Extracting insights from data (i.e., predictive analytics) gave birth to Operations Research as an inter-disciplinary field during World War II. The field has its roots in the 1840s based on the work Charles Babbage did to optimize the UK’s mail system. During WW II, UK and US scientists across many of the same disciplines people talk about today (mathematics, statistics, sociology and psychology) were brought together to help the Allied Forces optimize their artillery rounds and air / sea networks and decipher the German cryptographic codes. The field then branched out in the 1960s and 1970s in the telecom and airline industries and has since expanded across most of the business world. The fundamental mathematical techniques however have changed very little in the past 70 years.
We are all data scientists – Most of the innovation that is taking place at the data visualization layer today is about putting the information at the hands of those able to make the best decisions, i.e., the elusive business user / information worker. While this may feel self-serving as it allows technology companies to expand their footprint, my many years of working as a ‘data scientist’ have led me to the very same conclusion:
- The real challenge is about driving adoption: Although this is more relevant in an enterprise context, the challenge is not about squeezing the last drop of potential benefit, but rather ensuring recommendations are adopted. If there is one thing my many years in the field have taught me is convincing the decision makers to adopt your ideas. This Microsoft Windows 7 commercial sums it up best.
- Back-office data geeks do not always know the business challenge: Having been one myself, I can attest to the fact despite how smart we think we are, the knowledge that comes from knowing your business while being able to also act on those insights is priceless. The image and title of this post refers exactly to this point. PG&E is my local energy utility company and the data on the graph is my hourly energy consumption based on the smart meter (i.e., big) data they collect from my home. Who better to make decisions about energy consumption than the consumers themselves? Do the people appearing in this PG&E commercial look like data scientists to you?
What do you think? Is this short-sighted ‘old-world’ thinking, or the reality that will emerge over the next few years as we move past the hype?