Big Data

February 7, 2010
125 Views

Several month’s ago a short video appeared on YouTube with an interview of LinkedIn’s Chief Scientist DJ Patil. In it he discusses how ‘Big Data‘ impacts the practise of analytics. I’ve only just got around to posting about it but I am doing so now because he has some insights that I agree with and would like to share as they are still relevant. 

Big data is today most often associated with the internet superstars like Google, eBay and Amazon. There are 3 other areas with lower profiles where big data is important: intelligence (spooks, the military, etc.), scientific and academic research, and the financial markets.

Big data’s future is much bigger than this because more and more areas of human activity are going to be faced with vast data sets. When you hear people talking about the growth of knowledge and statements like ‘if this data were printed then the stack would grow faster than NASA’s fastest rocket‘, you have to remember that there is a good chance that each page of new data is adding to someone’s analytic data set.

    I’m not quoting the guy verbatim but here’s what I heard…

    Several month’s ago a short video appeared on YouTube with an interview of LinkedIn’s Chief Scientist DJ Patil. In it he discusses how ‘Big Data‘ impacts the practise of analytics. I’ve only just got around to posting about it but I am doing so now because he has some insights that I agree with and would like to share as they are still relevant. 

    Big data is today most often associated with the internet superstars like Google, eBay and Amazon. There are 3 other areas with lower profiles where big data is important: intelligence (spooks, the military, etc.), scientific and academic research, and the financial markets.

    Big data’s future is much bigger than this because more and more areas of human activity are going to be faced with vast data sets. When you hear people talking about the growth of knowledge and statements like ‘if this data were printed then the stack would grow faster than NASA’s fastest rocket‘, you have to remember that there is a good chance that each page of new data is adding to someone’s analytic data set.

      I’m not quoting the guy verbatim but here’s what I heard and my takeouts to his comments:

      • Open source ‘big data ready’ technologies like Hadoop (see my earlier blog or here) have come into their own now. Look to people with these skills over those only with SQL if you are facing big data challenges.
      • We have reached a tipping point in the use of open source for commercial solutions to big data problems.
      • If you want good analysts then the best place is to look is in occupations where people will already have the practical skills in manipulating big data sets: scientific fields like meteorology, oceanography and the like. I agree but this is not the only place as in my experience I also need analysts that relate well to business decision makers – i.e. those people that make commercial decisions based on the analytics. This is perhaps less important in pure tech plays like LinkedIn.
      • Open source will transform the practise of analytics in the next 3 – 5 years. I think it will take longer than this to really impact the more traditional industries. I’m not happy about this but I am realistic about the difficulty in convincing business leaders that open source is a superior solution to proprietary ones. The money behind the big vendors will keep them going for a number of years yet.

      One potential qualifier to DJ Patil’s perspective is that although he has a very impressive big data background as a mathematician, US Department of Defence analyst (‘Threat Anticipation’), and former eBay Director of Strategy and Analytics, his current employer is LinkedIn.

      The core of LinkedIn’s big data is structured and fairly static: profiles of people. So I’m not sure how similar their big data challenges would be to, say, those faced with processing, understanding and predicting large streams of real time data from financial markets or very large sensor arrays. On the other hand, the growth of LinkedIn communities and their related activities must generate large amounts of semi-structured data.

      I also have no idea what LinkedIn’s own analytic goals are beyond what DJ mentions on his own profile where he says his analytics drives product features like:

      • “People You May Know”
      • “Who Viewed My Profile”
      • “Groups You Might Like”

      Maybe somebody reading this blog knows more?

      The video is on YouTube and I embed it here for convenience:

      Or you can download ‘DJ Patil on How Big Data Impacts Analytics’ directly from this blog.

      Link to original post