As discussed in the previous post, performing Text Analytics for a language for which no tools exist is not an easy task.
As discussed in the previous post, performing Text Analytics for a language for which no tools exist is not an easy task. The Case Study which I will present in the 9th European Text Analytics Summit is about analyzing and understanding thousands of Non-English FaceBook posts and Tweets for Telco Brands and their Topics, leading to what is known as Competitive Intelligence.
The Telcos used for the Case Study are Telenor, MT:S and VIP Mobile which are located in Serbia. The analysis aims to identify the perception of Customers for each of the three Companies mentioned and understand the Positive and Negative elements of each Telco as this is captured from the Voice of the Customers – Subscribers.
By analyzing several thousands of Tweets and FaceBook posts and comments we can have a first glimpse of Competitive Intelligence. For example when we wish to identify which words frequently occur with mentions about postpaid packages this is what we find :
Red boxes show Telco Brands – notice “mts” and “mtsa” which point to the same Telco, namely mt:s. Blue boxes indicate similar words that should be merged. From a first look at the results above we see that :
a) mt:s is found more frequently when users mention PostPaid packages.
b) Telenor and VIP Mobile are not found as frequently as MT:S in PostPaid package conversations.
c) We see several problems from insufficient pre-processing : Kredit and Kredita (=credit) should merge into one word, the same applies for telefona –telefon,internet – interneta and mts – mtsa. Notice that we can perform the same High-level analysis for several Telco Topics such as Network, Billing, Customer Care, Promotions, Questions of subscribers and so on. The next task is to identify the reason(s) why MT:S was found to have more mentions about PostPaid packages. Note that at this point we do not know why this is so : It could be the fact that MT:S prices of prepaid packages are high, very cheap or something else is happening that needs to be identified.
The Serbian Language poses extra work because it is a highly inflected language : Even the ending of Brand names change according to the usage. Consider the following :
U mts-u (at mts) Sa mts-om (With mts) Bez mts-a (Without mts)
It is evident that a highly inflected language explodes our feature space and for this reason R can come to the rescue with some success. We can use R for changing several synonyms to one word, removing (Serbian) stop words, removing URLs and performing several other pre-processing steps that are necessary prior to an extensive analysis. More on the next post.