Who knows what happiness lurks in the hearts of men? Facebook knows.

Have you heard about the Facebook Gross National Happiness Index? On Monday, October 12, the Times ran an article (by Noam Cohen) reporting some of the findings based on analysis of two years’ worth of Facebook status updates from 100 million users in the U.S. The index was created by Adam D. I. Kramer, a doctoral candidate in social psychology at the University of Oregon, and is based on counts of positive and negative words in status updates. According to the article, classification of words as positive or negative is based on the Linguistic Inquiry and Word Count dictionary.

Among the researchers’ conclusions: we’re happier on Fridays than on Mondays; holidays also make Americans happy. The premature death of a celebrity may make us sad. According to a post by Mr. Kramer on the Facebook blog, the two “saddest” days – days with the highest numbers of negative words – were the days on which actor Heath Ledger and pop icon Michael Jackson died. Mr. Kramer points out that, coincidentally, Mr. Ledger died on the day of the Asian stock market crash, which might have contributed to the degree of negativity.

We’re going to see a lot more of this kind of thing as researchers …

We’re going to see a lot more of this kind of thing as researchers delve into the rich trove of information generated by users of search engines and web-enabled social networking. The happiness index, based as it is on simple frequency analysis of words, is the tip of the iceberg. At the moment, “social media” – I’m not exactly sure what that label means – is getting incredible attention in the marketing and marketing research community. The question that has yet to be posed, let alone answered, is, “what exactly do we learn from all this information?”

The Facebook Gross Happiness Index is revealing. We can see a pattern in the data. Status updates contain more words that are positive on occasions when we might expect greater positive sentiment, and fewer positive words on those days when we might expect greater negative sentiment. But anytime we look at spontaneously generated data, we need to ask “what’s missing?” User-generated content on the web is subject to coverage and selection biases which can sometimes be quite large. The happiness index provides an example. Is the pronounced negative sentiment on the day Heath Ledger died a function of the demographics of the Facebook community, or the subset who chose to update their status on that day, or some other unidentified subset of facebook users?

In the “what’s old is new” category, a book published more than forty years ago can serve a guide to using social media and user-generated content for research. Unobtrusive Measures: Nonreactive Research in the Social Sciences shed much light on the problems of traditional surveys, and offered a structured approach to using, in a sense, “found” data for social research. This book was out of print for a long time but happily has been reissued (at about 20 times the price I paid when I bought my copy as an undergraduate). While this book was out of print, Raymond M. Lee published Unobtrusive Measures in Social Research, which incorporated and updated the ideas in the original. Lee’s book includes sections on using the Internet for social science research. Either of these books should be required reading for anyone attempting to use social media and user-generated content for research purposes. The original Unobtrusive Measures exemplifies the best in social science writing, and is a pleasure to read.

The key to success with data generated from online activity, as with all research, is understanding the limitations in the data source and finding ways to compensate for those limitations. The authors of the original Unobtrusive Measures (Eugene J. Webb, Donald T. Campbell, Richard D. Schwartz, and Lee Sechrest) argue in favor of using multiple approaches. They view social science research as an “approximation” to knowledge, and the more points or manifestations we observe for each phenomenon of interest, the closer our approximation gets to some underlying “truth.”

The type of analysis reflected in the Facebook index has many limitations. Given the large number of individuals providing data, it’s tempting to discount potential biases due to coverage and selection. There are many potential individual-level causes that get rolled up into the aggregate numbers of positive and negative words in the updates posted on any given day. We might assume that Facebook has the potential to disaggregate the data to a degree, and provide more insight into the factors that might drive swings in the happiness index. That’s something we should look forward to seeing from them.