Social Media Analytics: Performance Measurement Done Right

By Philip Resnik, Ph.D. Lead Scientist at Converseon & ConveyAPI.

For a buyer of social media analytics, comparing the performance of various technologies is nothing short of baffling. This is especially true with respect to sentiment analysis — indeed text analytics in general — where scientific jargon, marketing puffery, and a laundry list of features can often obscure what really matters: using a technology meant to measure human expression, are we obtaining the value of a human analysis?

This notion of human performance as the ultimate goal is based on an important observation: when people analyze social media, we get valuable results.

When we built our social text analytics solutions, we recognized that, if only we could somehow take a few thousand people, shrink them and put them into a little box, and then get them to work thousands of times faster (to deal with seriously big data), we would have an incredible solution to our clients’ problems. Yes, people do make mistakes, and they disagree with each other about things. (Consider: “At this price point, I guess the smartphone meets the minimum requirements”. Three different people might fairly call this either positive or negative or neutral.) But even though human performance is imperfect, we know from our long-tested experience that human analysis provides all kinds of value that clients need.

So, when building and benchmarking our social media analysis technology, we set our sights on how close our system could get to human performance. One doesn’t need the technology to be 100% perfect, because people aren’t perfect, and we know people can get the job done just fine. (See the second paragraph again.) The right goal is for the technology to be as good as people.¹

With that in mind, here’s how we’re approaching the measurement challenge. The first step is to figure out how well people can do at the analysis we care about, so we know what we’re aiming for. How can you do that? Well, take someone’s analysis and have a second person judge it. Hmm. Wait a second. How do we judge whether the second person is a good judge? Add a third person to judge the second person. How do you now judge whether the third person is a good — Uh oh. You see the problem.

The problem is that there’s no ultimate, ideal judge at the end of the line. Nobody’s perfect. (But that’s ok, because we know that when people do the job, it delivers great value despite those imperfections. See that second paragraph yet again.) As it turns out, there’s a different solution: let your three people take turns judging each other. Here’s how it works. Treat Person 1’s analysis as “truth”, and see how Persons 2 and 3 do. Then treat Person 2’s analysis as truth, and see how Persons 1 and 3 do. Then treat Person 3’s analysis as truth, and see how Persons 1 and 2 do. It turns out that if we take turns allowing each person to define the “true” analysis for the others, and then average out the results, we’ll get a statistically reliable number for human performance — without ever having to pick any one of them as the person who holds the ultimate “truth”. This will give us a number that we can call the average human performance. ²

If we want to know if our system is good, we’ll compare how it does to average human performance. It’s the same turn-taking idea all over again, this time comparing system to humans rather than comparing humans to humans. That is: Treat Person 1’s analysis as “truth” and see how the system does. Do it again with Person 2 as “truth”. And Person 3. Average those three numbers, and we’ve got raw system performance.

The final step: what we really want to know is, how close is the raw system performance to average human performance? To get this you divide the former by the latter to get percentage of human performance. For example, let’s suppose that the average human performance is 74%. That is, on average, humans agree with each other 74% of the time. (If that number seems low, yes, you guessed it; second paragraph.) Suppose Systems A and B turn in raw system performances of 69% and 59%, respectively. Is one system really better than the other? How can you tell? System A is achieving 69/74 = 93% of human performance. System B achieves 59/74 = 80% of human performance. Out of all this numbers soup comes something that you can translate into understandable terms: System A is within spitting distance of human performance, but System B isn’t even within shouting distance. System A is better. ³

What we’ve just described is a rigorous and transparent method for evaluating the performance of social analytics methods. When you’re evaluating technologies on your short list, we suggest you use this approach, too.

If you don’t have the resources for such a rigorous comparison, let us know, and we’ll lend you a hand.

¹In a seminal paper about evaluation of language technology, Gale, Church, and Yarowsky established the idea of benchmarking systems against an upper bound defined by “the ability for human judges to agree with one another.” That’s been the standard in the field ever since. (William Gale, Kenneth Ward Church, and David Yarowsky. 1992. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. In Proceedings of the 30th annual meeting on Association for Computational Linguistics (ACL ’92). Association for Computational Linguistics, Stroudsburg, PA, USA, 249-256. DOI=10.3115/981967.981999 http://dx.doi.org/10.3115/981967.981999).

² This is an instance of a general statistical technique called cross validation.
³ You’re about to ask how we decide that 93% is “spitting distance” and 80% isn’t, aren’t you? Fair enough. But we never said that the buyer’s judgment wasn’t going to be important. Our point is that you should be asking 93% of what and 80% of what, and the what should be defined in terms of the goal that matters to you. If what you’re after is human-quality analysis, then percentage of human performance is the right measure. Subjectively we’ve found that if a system isn’t comfortably over 90% on this measure, it might be faster and more scalable, but it’s not providing the kind of quality that yields genuine insights for buyers.