Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    How Data Analytics Is Reshaping Patient Financing Decisions
    How Data Analytics Is Reshaping Patient Financing Decisions
    13 Min Read
    business using business intelligence
    How to Use a Competitive Intelligence Dashboard to Turn Market Data Into Smarter Marketing Decisions 
    9 Min Read
    unusual trading activity
    Signal Or Noise? A Decision Tree For Evaluating Unusual Trading Activity
    3 Min Read
    software developer using ai
    How Data Analytics Helps Developers Deliver Better Tech Services
    8 Min Read
    ai for stock trading
    Can Data Analytics Help Investors Outperform Warren Buffett
    9 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: An Introduction To Hands-On Text Analytics In Python
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > Text Analytics > An Introduction To Hands-On Text Analytics In Python
AnalyticsExclusiveText Analytics

An Introduction To Hands-On Text Analytics In Python

Ashish Kumar
Ashish Kumar
7 Min Read
hands on text analytics tutorial
Shutterstock Licensed Photo
SHARE

Python is a high-level, object-oriented development tool. Here is a quick, hands-on tutorial on how to use the text analytics function.

Contents
  • Reading a text file
  • Setting up NLTK
  • Reading a text file
  • Tokenisation
  • Dispersion plots
  • Converting tokens to NLTK text
  • Collocations
  • Word at a particular position
  • Position of a particular word
  • Concordance
    • Lemmatization
    • Text Cleaning
    • Term Frequency – Inverse Document Frequency
    • TF-IDF

Python enables four kinds of analytics:

  1. Text matching
  2. Text classification
  3. Topic modelling
  4. Summarization

Let’s begin by understanding some of the NLP features of Python, how it is set up and how to read the file used for:

Basics of NLP

Reading a text file

  • Tokenisation
  • Stemming & Lemmatization
  • Dispersion Plots
  • Word frequency

Setting up NLTK

  • import nltk
  • from nltk.book import*
  • nltk.download()

Reading a text file

import os

More Read

iot
The Rapid Evolution Of IoT: Trends Shaping The Digital Landscape
Ramifications of IT Infrastructure Everywhere
Cloud Technology Makes Virtual Assistants More Beneficial than Ever
Big Data Analytics – Volume, Variety, Velocity
Data Mining: Does It Get Any Better Than This?

os.chdir(‘F:/Work/Philip Adams/Course Content/Data’)

f=open(‘Genesis.txt’).read().decode(‘utf8’)

  • Our programs will often need to deal with different languages, and different character sets. The concept of “plain text” is a fiction.
  • ASCII
  • Unicode is used to process non-ASCII charcters
  • Unicode supports over a million characters. Each character is assigned a number, called a code point.
  • Translation into unicode is called decoding.

Let’s move a step deeper and understand the four basics of NLP in detail:

Tokenisation

  • Breaking up the text into words and punctuations
  • Each distinct word and punctuation

line=’Because he was so small, Stuart was often hard to find around the house. – E.B. White’

tokens=Because, he, was, so, small, Stuart, was, often, hard, to, find, around, the, house,’,’, E, B, White, ‘.‘, ‘-’

tokens=nltk.word_tokenize(f)

len(tokens)

tokens[:10]

Dispersion plots

Shows the position of a word across the document/text corpora

text.dispersion_plot([‘God’,’life’,’earth’,’empty’])

hands-on text analytics

Converting tokens to NLTK text

To apply NLTK processes, tokens need to be converted to NLTK test

text=nltk.Text(tokens)

Collocations

Words frequently occurring together

text.collocations()

one hundred; years old; Paddan Aram; young lady; seven years; little ones; found favor; burnt offering; living creature; every animal; four hundred; every living; thirty years; Yahweh God; n’t know; nine hundred; savory food; taken away; God said; ‘You shall

Word at a particular position

text[225]

Position of a particular word

text.index(‘life’)

Concordance

Finding the context of a particular word in the document

text.concordance(‘life’)

hands-on text analytics
  • Total number of words in a document

len(tokens)

  • Total number of distinct words in a document

len(set(tokens))

  • Diversity of words or percent of distinct words in the document

len(set(tokens))/len(tokens)

Percentage of text occupied by one word

100*text.count(‘life’)/len(tokens)

  • Frequency distribution of words in a document

from nltk.probability import FreqDist

fdist=FreqDist(tokens)

  • Function to return the frequency of a particular

def freq_calc(word,tokens):

from nltk.probability import FreqDist

fdist=FreqDist(tokens)

return fdist[word]

  • Most frequent words

fdist.most_common(50)

  • Other frequency distribution functions

fdist.max(), fdist.plot(), fdist.tabulate()

  • Counting the word length for all the words

([len(w) for w in text])

  • Frequency distribution of word lengths

fdistn=FreqDist([len(w) for w in text])

Fdistn

  • Returning words longer than 10 letters

[w for w in tokens if len(w)>10]

  • Stop words

Words which are commonly used as end points of sentences and carry less contextual meaning

from nltk.corpus import stopwords

stop_words=set(stopwords.words(‘english’))

  • Filtering stop words

filtered=[w for w in tokens if not w in stop_words ]

filtered

  • Stemming

Keeping only the root/stem of the word and reducing all the derivatives to their root words

For e.g. ‘walker’, ‘walked’, ‘walking’ would return only the root word ‘walk’

from nltk.stem import PorterStemmer

ps=PorterStemmer()

for w in tokens:

print ps.stem(w)

Lemmatization

Similar to stemming but more robust as it can distinguish between words based on Parts of Speech and context

For e.g. ‘walker’, ‘walked’, ‘walking’ would return only the root word ‘walk’

from nltk.stem import WordNetLemmatizer

lm=WordNetLemmatizer()

for w in tokens:

print lm.lemmatize(w)

lm.lemmatize(‘wolves’)

Result: u’wolf

lm.lemmatize(‘are’,pos=’v’)

Result: u’be

  • POS (Part of Speech) Tagging

Tagging each token/word as a part of speech

nltk.pos_tag(tokens)

hands-on text analytics

Regular

  • Regular Expressions

Expressions to denote patterns which match words/phrase/sentences in a text/document

  • re.search

matchObject = re.search(pattern, input_str, flags=0)

Stops after first match

import re

regex=r”(\d+)”

match=re.search(regex,”91,’Alexander’,’Abc123′”)

match.group(0)

Result: 91

re.findall

matchObject = re.findall(pattern, input_str, flags=0)

Stops after first match

import re

regex=r”(\d+)”

match=re.findall(regex,”91,’Alexander’,’Abc123′”)

match.group(0)

Result: 91, 123

re.sub

replacedString = re.sub(pattern, replacement_pattern, input_str, count, flags=0)

import re

regex=r”(\d+)”

re.sub(regex,”,”91,’Alexander’,’Abc123′”)

Result: “,’Alexander’,’Abc'”

Text Cleaning

Removing a list of words from the text

noise_list = [“is”, “a”, “this”, “…”]

def remove_noise(input_text):

words = input_text.split()

noise_free_words = [word for word in words if word not in noise_list]

noise_free_text = ” “.join(noise_free_words)

return noise_free_text

remove_noise(“this is a sample text”)

Replacing a set of words with standard terms

input_text=”This rt is actually an awsm dm which I luv”

words = input_text.split()

new_words = []

for word in words:

if word.lower() in lookup_dict.keys():

word = lookup_dict[word.lower()]

new_words.append(word)

new_words

new_text=” “.join(new_words)

new_text

N-Grams

N-grams is a sequence of words n items long.

def generate_ngrams(text, n):

words = text.split()

output = []

for i in range(len(words)-n+1):

output.append(words[i:i+n])

return output

generate_ngrams(“Virat may break all the records of Sachin”,3)

TF-IDF

Term Frequency – Inverse Document Frequency

convert the text documents into vector models on the basis of occurrence of words in the documents

Term Definition
Term Frequency (TF) Frequency of a term in document D
Inverse Document Frequency (IDF) logarithm of ratio of total documents available in the corpus and number of documents containing the term T
TF-IDF TF IDF formula gives the relative importance of a term in a corpus (list of documents)
hands-on text analytics

TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

obj = TfidfVectorizer()

corpus = [‘Ram ate a mango.’, ‘mango is my favorite fruit.’, ‘Sachin is my favorite’]

X = obj.fit_transform(corpus)

print X

hands-on text analytics

Other tasks

Text Classification

  • Naïve Bayes Classifier
  • SVM

Text Matching

  • Levenheisten distance – minimum number of edits needed to transform one string into the other
  • Phonetic matching – A Phonetic matching algorithm takes a keyword as input (person’s name, location name etc) and produces a character string that identifies a set of words that are (roughly) phonetically similar

Different ways of reading a text file

f=open(‘genesis.txt’)

words= f.read().split()

f.close()

f=open(‘genesis.txt’)

words=[]

for line in f:

print line.split()

f.close

hands-on text analytics
hands-on text analytics
hands-on text analytics

Different ways of reading a text file

f=open(‘genesis.txt’)

words=f.readline().split()

f.close()

words=f.readline().split()

f.close()

hands-on text analytics
hands-on text analytics
TAGGED:big dataNLPpythontext analyticstokenisation
Share This Article
Facebook Pinterest LinkedIn
Share
ByAshish Kumar
Follow:
Ashish is an author and a data science professional with several years of experience in the field of Advanced Analytics. He has a B.Tech from IIT Madras and is a Young India Fellow, an exclusive 1-year academic program on leadership & liberal arts offered to 215 young bright Indians, who show exceptional intellectual & leadership ability.

Follow us on Facebook

Latest News

How Data Analytics Is Reshaping Patient Financing Decisions
How Data Analytics Is Reshaping Patient Financing Decisions
Analytics Big Data Exclusive
AI driven big data company
How AI-Driven Workflows Are Changing the Way Companies Think About Data Risk
Artificial Intelligence Data Management Exclusive Risk Management
ai product development
Why Businesses Outsource AI Product Development Companies
Exclusive News
banking tools
The Fintech and Banking Tools Global Entrepreneurs Rely On
Fintech Infographic

Stay Connected

1.2KFollowersLike
33.7KFollowersFollow
222FollowersPin

You Might also Like

Image
Big DataData QualityData WarehousingUnstructured Data

What Are Accumulators? A Must-Know for Apache Spark

6 Min Read
hospital management systems
Big DataExclusive

Is Big Data Transforming Our Broken Hospital Management Systems?

5 Min Read
big data
Big DataExclusivePredictive Analytics

5 Ways Big Data Is Being Used To Understand COVID-19

10 Min Read
big data changing social media forever
Big DataExclusiveSocial Data

4 Ways to Boost Social Media Engagement With Big Data

10 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?