The Nature of Big Data and the Skills of Data Scientists
The job title Data Scientist was invented by DJ Patil and Jeff Hammerbacher when they tried to name people in their data team who work on big data and they did not want to limit people’s functions because of improper job title like business analyst or research scientist Building Data Science Teams
Ever since, the data scientist is becoming more and more popular with the big data becoming more critical to drive a successful business. However, some organizations still do not quite understand the roles that data scientists play and their responsibilities. It’s just like sometimes organizations do not know how to draw values from big data even though they are well convinced there are nuggets behind – their vision in using big data has actually blurred.
The nature of big data is defined by three Vs – Volume, Variety and Velocity. The roles and responsibilities of data scientists should be naturally determined by the nature of big data. First, as big data wears many hats, so does a data scientist who works on it. That means a data scientist has multiple roles and takes multiple responsibilities in an organization.
- Experise in Diverse Technologies
In order to tackle the big volume of data, a big data platform such as Apache Hadoop or LexisNexis HPPC is required to process big data. A data scientist should have a package of knowledge around a big data platform so that they can proficiently tackle the big data on its platform. A data scientist should
1) Have a thorough understanding about the framework of a big data platform like DFS and MapReduce programming framework to deliver robust application designs. That means a data scientist should also have the knowledge about software architecture, compoent and design.
2) Be proficient with several programming languages supported by a big data platform like Java, Python, C++, or ECL, etc.
3) Have a good understanding about database technologies, especially, NoSQL database like HBase, CouchDB, etc. Because a big data platform is usually communicating with databases to store variety of data format.
4) Be good expertise in math/statistics, machine learning and data mining fields.
The success of a business is not driven by the amount of data but rather driven by successfully finding and extracting interesting and novel patterns and relationship among data and use those gold values to develop the gold products – statistics, machine learning and data mining are great technologies used to understand data and dig out the nuggets from data. Naturally a data scientist must have the expertise in those fields for success. Skills to use some data mining tools or platform like R, Excel, SPSS and SAS is very critical, see Top Analytics and big data software tools
5) Be good at Natural Language Processing (NLP) software or tools – as most the content from big data are text based, news, social media and reports and comments, etc. Knowledge and master one or more NLP software or tools is very critical to the success as a data scientist.
6) Be skillful to one or more data visualization tools. In order to effectively demo the patterns and relationship mined from big data, be able to use some good visualization tools is definitely a plus to a data scientist. Here is a link of top 20 visualization tools.
- Innovation - curiosity
As the velocity of data change is so fast, constantly there are new findings and problems, a data scientist should be sensitive to those changes, be curiosity to new findings and creative to tackle new problems. He or she should also be passionate to communicate them in a timely manner, explore new product ideas and solutions with the new findings and become a driver for product innovation.
- Business Skills
First, the nature of wearing multiple hats as a data scientist drives the need for stronger communication skill. A data scientist has to communicate with diverse people in an organization that includes communicating and understanding business requirements, application requirements and interpret the patterns and relationships mined from data to people in marketing group, product development teams, and corporate executives. Effective communication is the key for a business to timely act on the new findings from big data. A data scientist should be a great collaborator and the hook of all.
Second, a data scientist needs great planning and organization skills so that he/she can skillfully handle multiple tasks and set up right priorities and guarantee timely delivery.
Third, a data scientist should have persuasive power, passion and story-telling skill to influence people to make the right decisions based on fact found in data and convince people the value of new findings. A data scientist in this sense is a leader to drive product innovation.
Overall the nature of big data defines the skills of data scientists and their roles in an organization.
Ling has been with Comcast for about 2 years as a senior manager of advanced analytics, data scientist in Enterprise Business Intelligence group. Prior to that Ling was a lead data scientist in an advertising industry, where she developed analytics road map, performed intensive analytics work to drive product improvement and innovation; she also has been a senior Data Scientist in a start-up ...