The Statistics of Everyday Talk

January 19, 2009
52 Views
As discussed on the previous post, the analysis of free text on the Web -and as an example the thoughts expressed by Twitter users- could extract very interesting insights on how users think and how they behave.

In 2001 I visited Trillium where i had a very useful seminar on Data Cleaning, Data Quality and Standardization during which the pareto principle became -once again- evident. When someone wishes to standardize entries in a Database so that the word “Parkway” is written in the same way across all records, he might find the following distribution of “parkway” entries :

15% of records contain the word “Parkway”
3% of records contain the word “Pkwy”
0.2% of records contain the word “Prkwy”
0.01% of records contain the word “Parkwy”

What that essentially means is that with a single SQL query one can find and correct 15% of “parkway” word synonyms to whatever standardized form is needed. But for the remaining variations one query solves only a very small fraction of the problem and this in turn increases the amount of work required, sometimes overwhelmingly.

In capturing and analyzing natural language we are confronted with the same problem : 60% of people might be usin

As discussed on the previous post, the analysis of free text on the Web – and as an example the thoughts expressed by Twitter users-  could extract very interesting insights on how users think and how they behave.

In 2001 I visited Trillium where I had a very useful seminar on Data Cleaning, Data Quality and Standardization during which the pareto principle became -once again- evident. When someone wishes to standardize entries in a Database so that the word “Parkway” is written in the same way across all records, he might find the following distribution of “parkway” entries :

15% of records contain the word “Parkway”
3% of records contain the word “Pkwy”
0.2% of records contain the word “Prkwy”
0.01% of records contain the word “Parkwy”

What that essentially means is that with a single SQL query one can find and correct 15% of “parkway” word synonyms to whatever standardized form is needed. But for the remaining variations one query solves only a very small fraction of the problem and this in turn increases the amount of work required, sometimes overwhelmingly.

In capturing and analyzing natural language we are confronted with the same problem : 60% of people might be using the same phrase for describing the fact that they don’t want to go to sleep with a simple “I don’t want to go to sleep”. But another 20% might be using something like : “i don’t feel like sleeping” and another 10% something like “i don’t want to go to bed right now”.

So we immediately see one of the issues that Text Miners face : The fact that we can use different phrases to communicate the same meaning. If we wish to analyze text information for classification purposes -say the sentiment of customers- we could achieve a 60-65% accuracy in our results with some effort. For a mere 4% increase in accuracy -from 65% to 69%- the amount of extra effort required could prove prohibitive.

Consider the following chart :

These are all examples of phrases people use in their everyday talk. We can visualize such phrases starting with” I don’t want to” and then each branch adds a new meaning to the phrase. So branches marked with numbers are the parts of speech that give us an idea of what a person doesn’t want to do : To go, to feel, to visit,to know. Things are getting much more difficult in terms of the effort required if we wish to add more detail -and probably insight- to our analysis by moving further down the branches in our sentence tree.

Perhaps for marketeers, the ability to quantify the distribution of words on the 1st level of the tree depicted above could be enough : If we end up with the following words distribution :

To feel : 15%
To know : 7%
To go : 1%
To visit : 1%

Then, we get an insight on which words to use to market products more efficiently.

On the next post we will go through a hands-on example of analyzing the thoughts of Twitter users and specifically what people seem to “don’t want”.

Link to original post

You may be interested

Big Data Revolution in Agriculture Industry: Opportunities and Challenges
Analytics
25 views
Analytics
25 views

Big Data Revolution in Agriculture Industry: Opportunities and Challenges

Kayla Matthews - July 24, 2017

Big data is all about efficiency. There are many types of data available, and many ways to use that information.…

How SAP Hana is Driving Big Data Startups
Big Data
298 shares3,195 views
Big Data
298 shares3,195 views

How SAP Hana is Driving Big Data Startups

Ryan Kh - July 20, 2017

The first version of SAP Hana was released in 2010, before Hadoop and other big data extraction tools were introduced.…

Data Erasing Software vs Physical Destruction: Sustainable Way of Data Deletion
Data Management
154 views
Data Management
154 views

Data Erasing Software vs Physical Destruction: Sustainable Way of Data Deletion

Manish Bhickta - July 20, 2017

Physical Data destruction techniques are efficient enough to destroy data, but they can never be considered eco-friendly. On the other…