Sometimes when analysing natural language it is necessary to normalise the text. For example converting everything to lower case ensures that 'Inform' and 'inform' are not treated as different words. Sometimes it might also be necessary to treat 'inform' and 'informed' as the same. In those cases stemming and/or lemmatizing will be necessary. But these processes can give unexpected results.
In natural language programming a stemmer will remove affixes from a word leaving only the stem. NLTK comes with a number of in-built stemmers. Comparison of the standard stemmers:
The Porter and Snowball stemmers generate the same result but Porter is for English only. The Lancaster stemmer produces very different results.
Lemmatization is simmilar to stemming except it always generates a valid word rather than a stem. For example:
classify gives classify rather than classifi
differentiate gives differentiate rather than differenti
in the case of words like believes it gives belief whereas the Porter stemmer will give believ
I have some text data scraped from the web. To prepare the data for analysis I will do the following:
1. remove html tags using Beautiful Soup
2. remove digits using a regular expression, an alternative approach would be to convert the digits into words, for example 1 becomes one.
3. convert everything to lower case
4. remove stop words using NLTK
5. remove anything that is less than 3 characters in length
6. tokenize the text into a list of words
7. check that the remaining words are actual words
8. use a stemmer to reduce words down to stems, this process needs to handled carefully as it can produce strange results sometimes
9. join the list of cleaned words back into a single text and write this out to a .txt file
#import necessary packages
from bs4 import BeautifulSoup as bs
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import LancasterStemmer
#to keep thing simple I have hard coded the text data into the script
text = '''<h2>Additional Information</h2>Viewing Times:<br/><br/>4:45pm - 5:00pm Thursday 15th June<br/>4:45pm - 5:00pm Tues 20th June<br/>4:45pm - 5:00pm Fri 23rd June<br/><br/><br/>
A deceptively spacious and well presented end terrace has recently been refurbished throughout and briefly comprises an open plan kitchen / living / dining area, downstairs w.c.,
four generous bedrooms - master with ensuite and a family bathroom with modern white suite. Additional benefits include gas fired central heating and double glazing throughout.
The current tenant is currently on a month to month basis with a passing rent of £500 per calendar month. <br/><br/>UPVC front door with glazed panel.
<br/><br/>OPEN PLAN KITCHEN / RECEPTION AREA 18' 2" x 14' 6" (5.55m x 4.42m) Modern shaker style kitchen with a range of high and low level units with formica work surfaces.
Integrated oven with matching four ring hob. Stainless steel extractor hood. Stainless steel single drainer sink unit with mixer tap. Space for fridge freezer. Integrated dishwasher.
Recessed low voltage spot lights. <br/><br/>W.C. Modern white suite comprising low flush w.c. Pedestal wash hand basin with mixer tap. Laminate wooden floor.
<br/><br/>BUILT IN STORAGE CUPBOARD Plumbed for washing machine. <br/><br/>FIRST FLOOR LANDING Access to under stair storage. <br/><br/>MASTER BEDROOM 10' 8" x 11' 4" (3.26m x 3.46m)
<br/><br/>ENSUITE White suite comprising low flush w.c. Pedestal wash hand basin. Enclosed shower unit with thermostatic shower. Tiled floor. Extractor fan. Recessed low voltage spot
lights. <br/><br/>BEDROOM 4 8' 3" x 9' 5" (2.52m x 2.89m) <br/><br/>BATHROOM Modern white suite comprising low flush w.c. Panelled bath with mixer tap. Pedestal wash hand basin with
mixer tap. Enclosed shower unit with thermostatic shower. Part tiled wall. Recessed low voltage spot lights. Extractor fan. <br/><br/>SECOND FLOOR LANDING
<br/><br/>BEDROOM 2 14' 3" x 8' 11" (4.36m x 2.72m) (widest points) Built in storage cupboard. Sky light. <br/><br/>BEDROOM 3 14' 6" x 7' 3" (4.43m x 2.22m) sky light.'''
#initialise a couple of lists for use later
suspect_words = 
stem_word_list = 
#this function uses nltk wordnet to check that each word is an actual word
if not wordnet.synsets(word):
#create a soup object then apply the get_text() function to remove tags
soup = bs(text, 'lxml')
text_bs = soup.get_text()
#remove digits and convert everything to lower case
text_bs_alpha = re.sub("[^a-zA-Z]", " ", text_bs.lower())
#use NLTK to tokenize the text, this creates a list of words
tokens = nltk.word_tokenize(text_bs_alpha)
remove stop words and anything of two characters or less, I added this condition because I found I was getting things like 'x' from the dimensions in the original text
word_list = [w for w in tokens if not w in stopwords.words("english") and len(w) >= 3]
#call the function to check each word in the tokenized list
for word in word_list:
#use a stemmer to reduce everything down to stems, I did not like the results I got, for example the word pedestal was reduced to pedest, other stemmers are avaialable, you can also write your own
stemmer = LancasterStemmer()
for word in word_list:
word_stem = stemmer.stem(word)
#join everything back into a single text and write this to a file
text_clean = " ".join(stem_word_list)
with open('clean_text.txt','w') as op_file:
#print out suspect words, this list included 'informationviewing', this was generated by running the get_text() function on 'Information</h2>Viewing', the </h2> was replaced with nothing
#I manually corrected the above problem
Notes: there is no one size fits all script that will do everything you need, it depends on the data you are starting with and what you want to do with the data.
We start with four sentences:
there is a dog in the garden
john enjoys watching movies
jane enjoys taking her dog for a walk
there are no dogs in this movie
Computers don't handle natural language well. One way to overcome this is to create a 'bag of words'
for each sentence. This can be thought of as a list of integer values. Each value is the number of occurences
of a word in the sentence.
The first step is to tokenize the sentences then to create a set from this list of tokens. We can then
itterate through each word in the set and count the number of times that word occurs in each sentence.
The number is appended to a list. This process generates three lists or vectors.
list_of_sentences = ['there is a dog in the garden','john enjoys watching movies','jane enjoys taking her dog for a walk','there are no dogs in this movie']
text = 'there is a dog in the garden john enjoys watching movies jane enjoys taking her dog for a walk there are no dogs in this movie'
tokens = nltk.word_tokenize(text)
set_of_words = set(tokens)
sent_1 = 
sent_2 = 
sent_3 = 
for i in range(1,4,1):
sentence = list_of_sentences[i-1]
sent_tokens = nltk.word_tokenize(sentence)
for word in set_of_words:
word_freq = sent_tokens.count(word)
if i == 1:
elif i == 2:
The output is:
[1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1]
the word set is:
set(['a', 'dogs', 'no', 'garden', 'her', 'watching', 'this', 'movie', 'is',
'there', 'dog', 'for', 'walk', 'movies', 'are', 'in', 'jane', 'taking', 'the',
so 'a' occurs once in the first sentence, 'dogs' does not occur, 'no' does not occur
and so on.
The bag of words approach can be used in computer vision for image recognition/classification.
There are a number of packages that offer 'bag of words' functionality, the above script is not
intended to replace these packages, it is just to demonstrate what is meant by 'bag of words'
The following script will take a document and compare it to a set of documents to find the document similarities.
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
raw_documents = ["An insecure file permissions vulnerability has been discovered in the official Icecream Ebook Reader v4.53 software. The vulnerability allows local attackers with system user accounts to elevate the access to higher system privileges.",
"A persistent cross site scripting web vulnerability has been discovered in the official Zenario v7.6 content management system.",
"The sensor communicates with the application processor via I2C bus #1, which also provides a firmware update interface.",
"Due to the lack of URI schemes validation any external URI scheme can be invoked by the Microsoft OneDrive iOS application with out any user interaction.",
"While performing network level testing of various Google applications, we discovered that the content for the application did not use SSL."]
gen_docs = [[w.lower() for w in word_tokenize(text)] for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('similarDocs', tf_idf[corpus], num_features = len(dictionary))
query_doc = [w.lower() for w in word_tokenize("The default installation directory for Icecream Ebook Reader is Icecream Ebook Reader with weak folder permissions that grants EVERYONE modify privileges to the contents of the directory and it's subfolders. This allows an attacker opportunity for their own code execution under any other user running the application.")]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_tf_idf = tf_idf[query_doc_bow]
output of the script:
[ 0.48266575 0. 0.01086096 0.13409612 0.17690402]
So the comparison document most closely matches the first document. The least similar is the second document with a score of two.
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
with open('sdlp.txt','r') as ip_file:
text = ip_file.read()
tokenizer = RegexpTokenizer('\s+', gaps=True)
tokens = tokenizer.tokenize(text)
stop = set(stopwords.words('english'))
list_of_words = [i.lower() for i in tokens if i.lower() not in stop and i.isalpha()]
wordfreqdist = nltk.FreqDist(list_of_words)
mostcommon = wordfreqdist.most_common(30)
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.