By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Understanding and Analyzing the Hidden Structures of a Unstructured Data Set
Share
Notification Show More
Latest News
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing
become a data scientist
Boosting Your Chances for Landing a Job as a Data Scientist
Jobs
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > Understanding and Analyzing the Hidden Structures of a Unstructured Data Set
Analytics

Understanding and Analyzing the Hidden Structures of a Unstructured Data Set

kunalj101
Last updated: 2014/08/28 at 5:41 PM
kunalj101
9 Min Read
SHARE

The key to using unstructured data set is to identify the hidden structures in the data set.

Contents
Business ProblemUnderstanding the datasetCleaning the dataset<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>[[1]] <<PlainTextDocument (metadata: 7)>>Paymt made to Messy 230023929#21 Barcllay[[2]] <<PlainTextDocument (metadata: 7)>>Transactn made to Big Bazaar 42323#2322 Barcllay[[3]] <<PlainTextDocument (metadata: 7)>>Pay to messy 342343#2434 Barcllay[[4]] <<PlainTextDocument (metadata: 7)>>messy bill pay 32344#24324 Barcllay > # remove punctuation> myCorpus <- tm_map(myCorpus, removePunctuation)> inspect(myCorpus)<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>[[1]] <<PlainTextDocument (metadata: 7)>>Paymt made to Messy 23002392921 Barcllay[[2]] <<PlainTextDocument (metadata: 7)>>Transactn made to Big Bazaar 423232322 Barcllay[[3]] <<PlainTextDocument (metadata: 7)>>Pay to messy 3423432434 Barcllay[[4]] <<PlainTextDocument (metadata: 7)>>messy bill pay 3234424324 Barcllay > # remove numbers> myCorpus <- tm_map(myCorpus, removeNumbers)> inspect(myCorpus)<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>[[1]] <<PlainTextDocument (metadata: 7)>>Paymt made to Messy Barcllay[[2]] <<PlainTextDocument (metadata: 7)>>Transactn made to Big Bazaar Barcllay[[3]] <<PlainTextDocument (metadata: 7)>>Pay to messy Barcllay[[4]] <<PlainTextDocument (metadata: 7)>>messy bill pay Barcllay > # remove stopwords> # Add required words to the list> myStopwords <- c(stopwords(‘english’), “Barcllay”)> myCorpus <- tm_map(myCorpus, removeWords, myStopwords)> inspect(myCorpus)<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>[[1]] <<PlainTextDocument (metadata: 7)>>Paymt made Messy[[2]] <<PlainTextDocument (metadata: 7)>>Transactn made Big Bazaar[[3]] <<PlainTextDocument (metadata: 7)>>Pay messy[[4]] <<PlainTextDocument (metadata: 7)>>messy bill pay > myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))> inspect(myDtm) <<TermDocumentMatrix (terms: 8, documents: 4)>>Non-/sparse entries: 12/20Sparsity : 62% Maximal term length: 9 Weighting : term frequency (tf)              DocsTerms     1 2 3 4bazaar    0 1 0 0big       0 1 0 0bill      0 0 0 1made      1 1 0 0messy     1 0 1 1pay       0 0 1 1paymt     1 0 0 0transactn 0 1 0 0End Notes

text

The key to using unstructured data set is to identify the hidden structures in the data set.

text

More Read

Text Mining

Text Mining Strategies and Limitations with Scalable Data Solutions

A Quick Guide to Structured and Unstructured Data
7 Important Types of Big Data
7 Key Terms for Negotiating Your Cloud Contract
Social Data on Chinese Microblogs and the Oscars

This enables us to convert it to a structured and more usable format.In previous article (previous article on text mining ) we discussed the framework to use unstructured data set in predictive or descriptive modelling. In this article we will talk in more details to understand the data structure and clean unstructured text to make it usable for the modelling exercise. We will be using the same business problem as discussed in last article to understand these procedures.

Business Problem

You are the owner of Metrro cash n carry. Metrro has a tie up with Barcllays bank to launch co-branded cards. Metrro and Barcllay have recently entered into an agreement to share transactions data. Barcllays will share all transaction data done on their credit card on any retail store. Metrro will share all transaction done by any credit card on their stores. You wish to use this data to track where are your high value customers shopping other than Metrro.

To do this you need to fetch out information from the free transactions text available on Barcllays transaction data. For instance, a transaction with free text “Payment made to Messy” should be tagged as transaction made to the retail store “Messy”. Once we have the tags of retail store and the frequency of transactions at these stores for Metrro high value customers, you can analyze the reason of this customer outflow by comparing services between Metrro and the other retail store.

Understanding the dataset

Let us first look at the raw data to build a framework for data cleaning. Following are sample transactions on which we need to work on :

  1. Paymt made to : Messy 230023929#21 Barcllay
  2. Transactn made to : Big Bazaar 42323#2322 Barcllay
  3. Paymt to messy : 342343#2434 Barcllay
  4. Mart America paymt : 32344#24324 Barcllay

Let us observe the data carefully to understand what information can be derived out of this data set.

  1. An Action word like “payment”, “Paymt” , “Transactn” is present in every transaction. It is possible that the word “pay” and “transact” refers to different modes of payments like Credit Card payment or cash card  payment.
  2. The word in the end of every transaction is common. This should be the name of card used.
  3. Every transactions has a name of the vendor. However, this name is both in small and capital letters.
  4. There is number code in every transactions. We can comfortably ignore this code or derive out very meaningful information from this code. This code can possibly be the name of the area where the store is present, some kind of combination with the date of purchase or the customer code. If we are able to decode these numbers, we possibly will get to the next level of analysis. For instance, if we can find the area of transaction, we can do an area level analysis. Or say these codes caters to product family, and hence can be used to optimize our services.

Cleaning the dataset

Cleaning text data on R is extremely easy. For this analysis we will not be using the numbers at the end of the transactions. But in case you are to make a strong analysis, this is something you should definitely explore. In this dataset, we need to make following adjustments :

  1. Remove the numbers
  2. Remove the special character “#”
  3. Remove common words like “to” , “is” etc.
  4. Remove the common term “Barcllay” from the end of every sentence
  5. Remove Punctuation marks

Given our understanding of data, step 2& 5  , 3 & 4 can be combined to avoid extra efforts. In step 2, we simply need to remove a single character “#”, which is automatically done in R while removing other punctuations . We will combine the words in step 3 and step 4, and remove them together. You can use the following code to clean the data set. Once we have the clean data set, we will convert it into a term document matrix. You can use the following codes for this exercise :

 

> library(tm) 
> myCorpus <- Corpus(VectorSource(a)) 
> inspect(myCorpus) 
 

<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>>

Paymt made to Messy 230023929#21 Barcllay

[[2]] <<PlainTextDocument (metadata: 7)>>

Transactn made to Big Bazaar 42323#2322 Barcllay

[[3]] <<PlainTextDocument (metadata: 7)>>

Pay to messy 342343#2434 Barcllay

[[4]] <<PlainTextDocument (metadata: 7)>>

messy bill pay 32344#24324 Barcllay

 

 

 

 

> # remove punctuation

> myCorpus <- tm_map(myCorpus, removePunctuation)

> inspect(myCorpus)


<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>>

Paymt made to Messy 23002392921 Barcllay

[[2]] <<PlainTextDocument (metadata: 7)>>

Transactn made to Big Bazaar 423232322 Barcllay

[[3]] <<PlainTextDocument (metadata: 7)>>

Pay to messy 3423432434 Barcllay

[[4]] <<PlainTextDocument (metadata: 7)>>

messy bill pay 3234424324 Barcllay

 

 

 

> # remove numbers

> myCorpus <- tm_map(myCorpus, removeNumbers)

> inspect(myCorpus)


<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>>

Paymt made to Messy Barcllay

[[2]] <<PlainTextDocument (metadata: 7)>>

Transactn made to Big Bazaar Barcllay

[[3]] <<PlainTextDocument (metadata: 7)>>

Pay to messy Barcllay

[[4]] <<PlainTextDocument (metadata: 7)>>

messy bill pay Barcllay

 

 

 

> # remove stopwords

> # Add required words to the list

> myStopwords <- c(stopwords(‘english’), “Barcllay”)

> myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

> inspect(myCorpus)


<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>>

Paymt made Messy

[[2]] <<PlainTextDocument (metadata: 7)>>

Transactn made Big Bazaar

[[3]] <<PlainTextDocument (metadata: 7)>>

Pay messy

[[4]] <<PlainTextDocument (metadata: 7)>>

messy bill pay

 

 

 

> myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))

> inspect(myDtm)

 

<<TermDocumentMatrix (terms: 8, documents: 4)>>

Non-/sparse entries: 12/20

Sparsity : 62% Maximal term length: 9 Weighting : term frequency (tf)

 

 

 

           Docs

Terms     1 2 3 4

bazaar    0 1 0 0

big       0 1 0 0

bill      0 0 0 1

made      1 1 0 0

messy     1 0 1 1

pay       0 0 1 1

paymt     1 0 0 0

transactn 0 1 0 0

End Notes

Cleaning data sets is a very crucial step in any kind of data mining. However, it is many times more important while dealing with unstructured data sets. Understanding the data and cleaning the data consumes the maximum time of any text mining analysis. In the next article we will talk about creating a dictionary manually. This becomes important when we are doing a niche analysis for which ready made dictionary is either not available or very expensive.

Have you done text mining before? If you did, what other cleaning steps did you leverage? What tool do you think is most suitable for doing a niche kind of text mining like transactions analysis or behavioral analysis? Did you find the article useful? Did this article solve any of your existing dilemma?

TAGGED: unstructured data
kunalj101 August 28, 2014
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

[mc4wp_form id=”1616″]

You Might also Like

Text Mining
Big DataData Mining

Text Mining Strategies and Limitations with Scalable Data Solutions

5 Min Read
Structured Data vs Unstructured Data
AnalyticsBig DataData ManagementData MiningHadoopMapReduceMarketingSocial DataStatisticsUnstructured DataWeb Analytics

A Quick Guide to Structured and Unstructured Data

7 Min Read
big data types structured and unstructured data
AnalyticsBig DataBusiness IntelligenceCloud ComputingCollaborative DataData ManagementData MiningData QualityData VisualizationData WarehousingHadoopITMapReduceOpen SourceSocial DataSoftwareSQLUnstructured DataWorkforce Data

7 Important Types of Big Data

5 Min Read
cloud computing contract
Business IntelligenceBusiness RulesCloud Computing

7 Key Terms for Negotiating Your Cloud Contract

3 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?