The Fallacy of the Data Scientist Shortage
There is no question that the USA (in fact, most of the world) would be well-served with more quantitatively capable people to work in business and government. However, the current hysteria over the shortage of data scientists is overblown. To illustrate why, I am going to use an example from air travel.
On a recent trip from Santa Fe, NM to Phoenix, AZ, I tracked the various times:
Drive from Santa Fe to ABQ Airport
Wait to board
Wait for valet bag
Travel to rental car
Arrive at destination in Tempe
As you can see, the actual flying time of 60 minutes represents only 19% of the travel time. Because everything but the actual flight time is more or less constant for any domestic trip (disregarding common delays, connections and cancellations which would skew this analysis even farther), this low percentage of time in the air is a reality. For example, if the flight took 2 hours and fifteen minutes, it would still work out to 135/386 = 35%. The most recent data I have, from 2005, shows the average non stop distance flown per departure was 607 miles, so we can add about 25 minutes to the first calculation and arrive at 85/336 = 25%.
Keep in mind, again, these calculations do not account for late departures/arrivals, cancelled and re-booked flights, connections, flight attendants and pilots having nervous breakdowns, etc. It’s safe to say that at most 25% of your travel time is spent in the air. Just for fun, let’s see how this would work out if we could take the (unfortunately retired) Concorde. We would reduce our travel time by flying at Mach 2.5 by 40 minutes, trimming out journey from five hours and eleven minutes to four hours and 31 minutes, about a 13% improvement.
What’s the point of all of this and what does it have to do with the so-called data scientist shortage?
Based on our research at Constellation Research, we find that analysts that work with Hadoop or other big data technologies spend a significant amount of time NOT requiring any knowledge of advanced quantitative methods – configuring and maintaining clusters, writing programs to gather, move, cleanse and otherwise organize data for analysis and many other common tasks in data analysis. In fact, even those who employ advanced quantitative techniques spend from 50-80% of their time gathering, cleansing and preparing data. This percentage has not budged in decades. Keep in mind that advanced analytics is not a new phenomenon; what is new is the volume (to some extent) and variety of the source data with new techniques to deal with it, especially, but not limited to, Hadoop.
The interest in analytics has risen dramatically in the past two or three years, that is not in dispute. But the adoption of enterprise-scale analytics with big data is not guaranteed in most organizations beyond some isolated areas of expertise. Most of the activity is in predictable (commercial) industries – net-based businesses, financial services, and telecommunications, for example, but these businesses have employed very large-scale analytics, at the bleeding edge of technology for decades. For most organizations, analytics will be provided by embedded algorithms in applications not developed in-house and third-party vendors of tools and services and consultants.
The good news is that 80% of the expertise you need for big data is readily available. The balance can be sourced and developed. “The crème-de-la-crème of data scientists will fill roles in academia, technology vendors, Wall Street, research and government.
There are related and unrelated disciplines that are all combined under the term analytics. There is advanced analytics, descriptive analytics, predictive analytics and business analytics, all defined in a pretty murky way. It cries out for some precision. Here is how I characterize the many types of analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques.
Quantitative Research (True Data Scientist)
PhD or equivalent
Creation of theory, development of algorithms. Academic/research. Often employed in business or government for very specialized roles
(Current definition of) Data Scientist or Quantitative Analyst
Advanced Math/Stat, not necessarily PhD
Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge
Good business domain, background in statistics optional
Running and managing analytical models. Strong skills in and/or project management of analytical systems implementation
Business Intelligence/ Discovery
Data and numbers oriented, but no special advanced statistical skills
Reporting, dashboard, OLAP and visualization use, possibly design, Performing posterior analysis of results driven by quantitative methods
“Data Scientist” is a relatively new title for quantitatively adept people with accompanying business skills. The ability to formulate and apply tools to classification, prediction and even optimization, coupled with fairly deep understanding of the business itself, is clearly in the realm of Type II efforts. However, it seems pretty likely that most so-called data scientists will lean more towards the quantitative and data-oriented subjects than business planning and strategy. The reason for this is that the term data scientist emerged from those businesses like Google or Facebook where the data is the business; so understanding the data is equivalent to understanding the business. This is clearly not the case for most organizations. We see very few Type II data scientists with the in-depth knowledge of the whole business as, say, actuaries in the insurance business, whose extensive training should be a model for the newly designated data scientists (see my blogs "Who Needs Analytics PhD's? Grow Your Own” and “What is a Data Scientist and What Isn’t.”)