Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    big data and customer service outsourcing
    How Data Analytics Improves Customer Service Outsourcing
    18 Min Read
    How a Specialized Marketing VA Improves Campaign Analytics
    How a Specialized Marketing VA Improves Campaign Analytics
    11 Min Read
    New Data Analytics Breakthroughs Give eCommerce Startups a Fighting Chance
    New Data Analytics Breakthroughs Give eCommerce Startups a Fighting Chance
    6 Min Read
    How Data Analytics Is Reshaping Patient Financing Decisions
    How Data Analytics Is Reshaping Patient Financing Decisions
    13 Min Read
    business using business intelligence
    How to Use a Competitive Intelligence Dashboard to Turn Market Data Into Smarter Marketing Decisions 
    9 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: The Statistics of Everyday Talk
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Uncategorized > The Statistics of Everyday Talk
Uncategorized

The Statistics of Everyday Talk

ThemosKalafatis
ThemosKalafatis
5 Min Read
SHARE
As discussed on the previous post, the analysis of free text on the Web -and as an example the thoughts expressed by Twitter users- could extract very interesting insights on how users think and how they behave.

In 2001 I visited Trillium where i had a very useful seminar on Data Cleaning, Data Quality and Standardization during which the pareto principle became -once again- evident. When someone wishes to standardize entries in a Database so that the word “Parkway” is written in the same way across all records, he might find the following distribution of “parkway” entries :

15% of records contain the word “Parkway”
3% of records contain the word “Pkwy”
0.2% of records contain the word “Prkwy”
0.01% of records contain the word “Parkwy”

What that essentially means is that with a single SQL query one can find and correct 15% of “parkway” word synonyms to whatever standardized form is needed. But for the remaining variations one query solves only a very small fraction of the problem and this in turn increases the amount of work required, sometimes overwhelmingly.

In capturing and analyzing natural language we are confronted with the same problem : 60% of people might be usin…

As discussed on the previous post, the analysis of free text on the Web – and as an example the thoughts expressed by Twitter users-  could extract very interesting insights on how users think and how they behave.

In 2001 I visited Trillium where I had a very useful seminar on Data Cleaning, Data Quality and Standardization during which the pareto principle became -once again- evident. When someone wishes to standardize entries in a Database so that the word “Parkway” is written in the same way across all records, he might find the following distribution of “parkway” entries :

15% of records contain the word “Parkway”
3% of records contain the word “Pkwy”
0.2% of records contain the word “Prkwy”
0.01% of records contain the word “Parkwy”

What that essentially means is that with a single SQL query one can find and correct 15% of “parkway” word synonyms to whatever standardized form is needed. But for the remaining variations one query solves only a very small fraction of the problem and this in turn increases the amount of work required, sometimes overwhelmingly.

In capturing and analyzing natural language we are confronted with the same problem : 60% of people might be using the same phrase for describing the fact that they don’t want to go to sleep with a simple “I don’t want to go to sleep”. But another 20% might be using something like : “i don’t feel like sleeping” and another 10% something like “i don’t want to go to bed right now”.

So we immediately see one of the issues that Text Miners face : The fact that we can use different phrases to communicate the same meaning. If we wish to analyze text information for classification purposes -say the sentiment of customers- we could achieve a 60-65% accuracy in our results with some effort. For a mere 4% increase in accuracy -from 65% to 69%- the amount of extra effort required could prove prohibitive.

Consider the following chart :

These are all examples of phrases people use in their everyday talk. We can visualize such phrases starting with” I don’t want to” and then each branch adds a new meaning to the phrase. So branches marked with numbers are the parts of speech that give us an idea of what a person doesn’t want to do : To go, to feel, to visit,to know. Things are getting much more difficult in terms of the effort required if we wish to add more detail -and probably insight- to our analysis by moving further down the branches in our sentence tree.

Perhaps for marketeers, the ability to quantify the distribution of words on the 1st level of the tree depicted above could be enough : If we end up with the following words distribution :

To feel : 15%
To know : 7%
To go : 1%
To visit : 1%

Then, we get an insight on which words to use to market products more efficiently.

On the next post we will go through a hands-on example of analyzing the thoughts of Twitter users and specifically what people seem to “don’t want”.

Link to original post

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

big data and customer service outsourcing
How Data Analytics Improves Customer Service Outsourcing
Analytics Exclusive
The End of Unstructured Marketing: Forcing Generative AI into Strict HTML Schemas
The End of Unstructured Marketing: Forcing Generative AI into Strict HTML Schemas
Artificial Intelligence Exclusive
How a Specialized Marketing VA Improves Campaign Analytics
How a Specialized Marketing VA Improves Campaign Analytics
Analytics Exclusive
ai marketing tools
The 9 AI Tools Marketers Use to Create Images and Video in 2026
Artificial Intelligence Exclusive

Stay Connected

1.2KFollowersLike
33.7KFollowersFollow
222FollowersPin

You Might also Like

SOA 2009: Do we need architects or firefighters?

1 Min Read

It’s not about recovery, it’s about reinvention!

7 Min Read

For Rent: Lots Of Space (Not Digital)

4 Min Read

SQLCruise – The “Social-ism” Factor

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai chatbot
How AI Website Chatbots Improve Customer Support and Lead Generation
Chatbots Exclusive
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-26 SmartData Collective. All Rights Reserved.
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?