The Promise and Perils of Text Analytics — Privacy

October 27, 2009
243 Views

Text analytics may be the next wave of computer analysis. In contrast to quantitative analytics, text analytics broadens mining raw source data beyond numbers to include words, phrases and sentences – alpha characters.

The potential of text analytics’ applications and benefits are endless. They range from marketers more quickly, potentially in near real-time, detecting consumer trends to applications improving fraud detection. However, like any new digital frontier there are both promises and perils.

A New York Times article, When 2+2 Equals a Privacy Question, revealed some of the potential risks to personal privacy. The article presents an example where Netflix, the movie DVD rental company, provided a data set to researchers containing 480,000 of their customers’ movie preferences. The initial purpose was to seek improvements to its recommendation software. The customers’ identification had been removed from the data. However, researchers from the University of Texas were able to re-identify individual customer names by correlating the presumably anonymous data with digital trails left on blogs, Twitter, Facebook, chat rooms, and cinema websites like Imdb.com. (A



Text analytics may be the next wave of computer analysis. In contrast to quantitative analytics, text analytics broadens mining raw source data beyond numbers to include words, phrases and sentences – alpha characters.

The potential of text analytics’ applications and benefits are endless. They range from marketers more quickly, potentially in near real-time, detecting consumer trends to applications improving fraud detection. However, like any new digital frontier there are both promises and perils.

A New York Times article, When 2+2 Equals a Privacy Question, revealed some of the potential risks to personal privacy. The article presents an example where Netflix, the movie DVD rental company, provided a data set to researchers containing 480,000 of their customers’ movie preferences. The initial purpose was to seek improvements to its recommendation software. The customers’ identification had been removed from the data. However, researchers from the University of Texas were able to re-identify individual customer names by correlating the presumably anonymous data with digital trails left on blogs, Twitter, Facebook, chat rooms, and cinema websites like Imdb.com. (A Netflix spokesperson disputed the findings by claiming the data sets had been altered, but that is a different discussion for another time.)

When an example like this is applied to electronic medical or criminal records, the implications become more serious. Individuals may not want that kind of information revealed to motivated third parties like potential employers that can possibly cause social, professional and financial damage to an individual. (As full disclosure, my employer SAS offers text analytics solutions, and I impressed with the awareness and concerns that my co-workers have about this topic.)

Re-identification of presumably de-identified data shifts the discussion of text analytics into the realm of privacy and ethics. It is my hope that society comes to grips with ways to manage these risks. Analytics can be so much fun. Examples abound, not only with marketers anticipating trends and improving targeted messages and offers to customers and prospects; but also, for example, with sports enthusiasts who seek to resolve debates about “best” athletes or teams.