Recently, I’ve been feeling like I’ve stepped through a looking glass to another similar-but-very-different world. I’m steeped in 20+ years in corporate data warehousing and business intelligence practice. Throughout that time, there have been big and small technology improvements, but nothing truly disruptive (although new analytic platforms are coming).
Meanwhile, a completely different thread of data analysis has emerged, with roots in open-source software, notably Hadoop, primarily designed for processing massive amounts of semi-structured web data. As the technology has advanced, it’s making more and more impact on “traditional” data warehousing.
The people using these new technologies have founded their own visions of what the role of an analyst looks like, or as they call it, a “data scientist”. DJ Patil and Jeff Hammerbacher coined the term a few years ago (there’s a nice graphical summary of the data scientist role by David Vellante), and DJ recently wrote an excellent piece defining the role.
He explains the skills a data scientist needs to be successful:
- Finding rich data sources.
- Working with large volumes of data despite hardware, software, and bandwidth constraints.
- Cleaning the data and making sure that data is consistent.
- Melding multiple datasets together.
- Visualizing that data.
- Building rich tooling that enables others to work with data effectively.
Reading the list, I couldn’t help but say to myself “people have been doing this since computers were invented! what’s the big deal!”, but ultimately I’m excited about the new technology possibilities and a new point of view, and I’m looking forward to a synthesis of the best of the old and the new to get even more business value out of data.
But the booming interest in “data scientists” also worries me: the underlying premise is that (a) “advanced” analytics is what’s most important, and (b) analytics is the domain of “scientists”. The focus of the data scientists article is generally about elite teams working on advanced, strategic problems. A data science team is defined as:
“a group that includes people working in design, web development, engineering, product marketing, and operations” that “delve into existing data sources and meld them with external data sources to understand the competitive landscape, prioritize strategy and tactics, and provide clarity about hypotheses that may arise during strategic planning.”
Over the years, we’ve made slow progress towards making everybody in the organization “responsible” for analysis, and it would be a shame if data scientists became the new high priests of knowledge. To get business value, number-crunching has to be combined with the knowledge spread through the company. I believe that it takes people to turn information into intelligence, and rather than focusing only on advanced analytics, we need to encourage all employees to be more data literate (see this example of what can go wrong) and encourage more shared analysis.
Luckily, it seems data scientists do indeed share these values:
“I’ve found that the strongest data-driven organizations all live by the motto “if you can’t measure it, you can’t fix it” (a motto I learned from one of the best operations people I’ve worked with). This mindset gives you a fantastic ability to deliver value to your company by:
- Instrumenting and collecting as much data as you can. Whether you’re doing business intelligence or building products, if you don’t collect the data, you can’t use it.
- Measuring in a proactive and timely way. Are your products, and strategies succeeding? If you don’t measure the results, how do you know?
- Getting many people to look at data. Any problems that may be present will become obvious more quickly — “with enough eyes all bugs are shallow.”
- Fostering increased curiosity about why the data has changed or is not changing. In a data-driven organization, everyone is thinking about the data.”
“More sophisticated data-driven organizations thrive on the “democratization” of data. Data isn’t just the property of an analytics group or senior management. Everyone should have access to as much data as legally possible.”
These statements seem to be at odds with the whole notion of “data scientist” as an elite role, but maybe we’re “all data scientists now”?
Personally, I’m excited about the possibility of finding common ground, with new collaborative decision technologies such as SAP StreamWork that allow us not only to “get more people to look at data”, but share their different knowledge and points of view, align it with the key business concerns and learn from our past decision-making mistakes.