The Pros and Cons of Collaborative Data Modeling

In the field of analytics – as in life – there are often multiple ways to come up with a solution to a problem. Since the types of business problems companies attempt to solve in today’s fast-paced and increasingly complex business environment are often multi-layered and difficult to crack, brainstorming can frequently deliver the best set of options for tackling even the most vexing issues.

Just as shrewd business leaders have come to rely on the collective intelligence and experience of their top lieutenants for effective decision making, so too are enterprise analytics teams increasingly relying upon collaborative approaches to problem solving.

In its Gartner Predicts 2012 research reports, the research firm says organizations will increasingly include the vast amounts of data from social networking sites in their decision-making processes. However, Gartner also says that over half of the investments made by companies in analytics tools will be wasted, because of cultural immaturity, a lack of required skills and inappropriate training levels.

In a Spotfire blog post from earlier this year, we also talked about the benefits of drawing upon the collective wisdom of a group by crowdsourcing analytics . One such forum is Kaggle, an online platform for predictive modeling competitions. Platforms such as Kaggle are making it possible for data scientists to come together on a wide variety of data modeling exercises.

As described on its web site, Kaggle offers companies a cost-effective way to harness the “cognitive surplus” of the world’s best data scientists. For instance, Kaggle recently fielded a competition with a prize pool of $10,000 for teams of data scientists to accurately predict market responses to large trades.

Nonetheless, collaborative data modeling can also be fraught with challenges, as noted in an article on the topic by Ventana Research Vice President and Research Director David Menninger (@dmenningervr). Some approaches to collaboration have centered on the use of social media tools. But as Menninger argues, while social media can be a vehicle for supporting conversations between people, data modeling is a considerably more complex exercise that requires workflow techniques and approval processes. These are important factors for decision makers to take into account.

Still, some online communities that have cropped up have shown promise for new approaches to collaborative data modeling. For example, Cross Validated is a free, community-driven Q&A forum for statisticians, data analysts, data miners, and data visualization experts.

Some straightforward programmer-type questions such as “Does anyone know a way to segment words into syllables using R?” are fairly easy to answer in a Q&A forum such as Cross Validated. But other problems are likely to generate a variety of opinions where there isn’t necessarily a single valid answer. For instance, “What should k be in a k-fold cross validation?” Under these circumstances, disagreements between community members are likely to break out as to whether cross-validation works.

Participants and visitors can view the hottest threads based on votes or views, such as the best method to visualize large interaction between two factors. Another popular thread asks participants to name the most famous statisticians and what it is that made them famous.

More of these types of communities will continue to populate, creating additional opportunities for companies of all sizes to leverage the collective wisdom of the crowd. And while many of these sites aren’t perfect, they offer data scientists a terrific chance to connect with each other across all corners of the globe to brainstorm on approaches to tackling vexing problems.