The Technology behind Social Media Analytics – An interview with Greg Greenstreet, CTO, SVP Engineering of Collective Intellect

February 2, 2011
159 Views

Recently, I had the great opportunity to sit down with Greg Greentstreet, our CTO and SVP of Engineering here at Collective Intellect. Many of our most recent blog posts are about the uses of social media analytics or trends and insight in Social CRM that we thought it might be a good time to talk about the technology behind Collective Intellect. Greg had a lot of patience with me as he described the differences between semantic analysis, Boolean search and natural language processing.

Recently, I had the great opportunity to sit down with Greg Greentstreet, our CTO and SVP of Engineering here at Collective Intellect. Many of our most recent blog posts are about the uses of social media analytics or trends and insight in Social CRM that we thought it might be a good time to talk about the technology behind Collective Intellect. Greg had a lot of patience with me as he described the differences between semantic analysis, Boolean search and natural language processing. We talked about why data accuracy is more important than data integrity and trends he sees in the future. The interview is divided into two-parts, with the first part talking about how our technology works and the second is devoted to how the data is organized and configured to surface trends, themes and audience traits and profiles.

What’s the biggest change you have seen in social media analytics and what are the different technologies being used to analyze social media conversations?

There are a couple of significant changes I’ve seen developing over the past 12 -18 month. Today, in order to remain relevant and competitive, sophisticated and analytically savvy organizations must move beyond awareness metrics provided by early monitoring and analytic technologies and pursue in-depth, contextually relevant information.

About a year ago many companies that were just getting started used monitoring platforms used technology that relied on basic keywords or Boolean term expressions, which were easy to use and implement. But they quickly learned that these types of tools have short-lived value especially if your analysis involved ambiguous language. These types of solutions presume you know all the terms that might be used to refer to a specific term.  If you look at a term like “Crocs”, which can refer to the popular shoe or the reptile in a conversation, you’d have to continuously include or exclude content on the basis of keyword matching because keyword matching alone fails to disambiguate the meaning of terms.

Some monitoring tools use Linguistics Rules-Based NLP techniques in a further attempt to disambiguate content.  This technique can be costly both in terms of time to develop the complex models involved, as well as the time it takes to process each textual item.  It also requires additional linguistics rule sets anytime the context of conversation shifts, making it difficult to apply to unstructured textual data sets like social media.  Collective Intellect’s solution addresses the inaccuracy and bluntness of keyword search and the speed and cost disadvantages of NLP techniques through the use of advanced statistical language modeling.

Let’s talk about the technology CI uses in CI:Insight, our social media analytics tool. What makes it different than keyword search or NLP techniques?

CI’s semantic engine is based on LSA, an advanced form of statistical language modeling.   LSA is a method for exposing latent contextual-meaning within a large body of text. It does this by looking at word usage (specifically, word co-occurrence) within a set of documents. Words that appear in similar contexts are assumed to have similar meaning and/or relational significance.  LSA constructs a large matrix of term-document association data.  Each cell in the matrix contains a weighted value, which is proportional to the number of times each term appears within each document in the set. The weights are structured such that more rare terms have greater weights. This allows more relevant terms to carry more weight to construct more accurate vectors of how consumers are talking about a category, brand or product.  This technique deciphers the relationships and correlations between words and plots where they dimensionally reside in proximity to a specific topic of interest.  LSA extracts specialized language features from a large data set and selects conversations based on their meaning. By isolating the contextual meaning of a topic, semantic filtering minimizes miss-categorizations (false positives) and inappropriate rejections (false negatives) that can otherwise occur when using other techniques and technologies. The resulting categorization is more relevant and pertinent to a research query. LSA learns in much the way the human brain does, by recognizing the context of language from the all of previous times it has seen a term within that context.  This produces a technology that can accurately disambiguate a term that is used in multiple contexts.

Take a look at the image, it illustrates the volume of invalid data received when relying on keyword or Boolean search, as compared to semantic filtering with the common term “Goldfish” as it relates to the brand of crackers:

Now imagine trying to write a Boolean keyword expression to capture conversations about a topic categorization like ‘Crocs’, the shoes; the expression quickly becomes unmanageable as negative terms are added in an attempt to exclude references to ‘croc’odiles. By using a semantic filter, CI’s Social CRM Insight solution isolates content in the shoes and sandals category and employs a simple keyword search – “Crocs” – to categorize content without having to worry about false positives occurring from crocodiles.

Semantic Search

“…Speaking of comfy, when is the Crocs craze going to end? It’s winter, and although it isn’t snowing everywhere, it’s snowing in Ohio. Why are people wearing Crocs with several pairs of socks in order to keep dry? I could understand if they were Louboutin’s, but honestly people.”

Keyword Search

“THESE baby crocs may look like cute pets – but beware. Measuring 30cm they will eventually reach three meters in length and could live to 80…”

Once we have an accurate and robust sample, what happens next? How does the technology optimize the data for analysis?

CI’s technology is used in a compounding fashion, starting with topic categorization, to theme extraction, then to trait extraction. CI’s semantic search and analytics technology is unique with its proprietary approach to how data is handled, categorized and measured for relevancy.  The proprietary technologies isolate important attributes from groups of authors and reveal unique considerations and preferences in addition to providing the ability to identify unknown associations occurring through natural online conversation.

We’ve talked a little about topic categorization but what do you mean by theme or trait extraction?

Let’s take theme extraction first.  Semantic analysis can be used to generate more meaningful themes associated with a topic. By coupling state-of-the-art clustering algorithms with semantic proximity measures, themes are derived by grouping semantically similar posts. This gives CI the ability to parse out various conversations occurring within a topic.

CI’s semantic filtering technology produces more meaningful themes than those produced by simple keyword term occurrence techniques seen in typical topic Tag Clouds that produce meaningless lists of top terms by simple counts only.  These techniques do not employ the use of contextual relevancy and therefore are saddled with an inherent limitation– the ability to understand the conversations underlying a particular topic. For example, if the topic is iPhone, it could be expected that iPhone would emerge as the biggest “tag” (theme) which in and of itself renders the term meaningless because a topic should not be a theme unto itself.

Using CI’s semantic themes, you can see true clusters of conversation based on meaning and then use those themes to create filter. As you create filters by accepting or rejecting themes, you can immediately test the filters to see if they are targeting the exact content you want. You can continue to iterate and add themes as accept or reject filters until only the content you want is passing through.  Once you have refined a set of filters that produce accurate data, you can then apply them to larger datasets or use them to categorize content as a ‘topic’ in CI’s continual stream of comprehensive social media data.

Next week, we’ll dig deeper into trait extraction, customer preferences, profiles and insights and the future of integrated social media analytics.