I have some text data scraped from the web. To prepare the data for analysis I will do the following:
1. remove html tags using Beautiful Soup
2. remove digits using a regular expression, an alternative approach would be to convert the digits into words, for example 1 becomes one.
3. convert everything to lower case
4. remove stop words using NLTK
5. remove anything that is less than 3 characters in length
6. tokenize the text into a list of words
7. check that the remaining words are actual words
8. use a stemmer to reduce words down to stems, this process needs to handled carefully as it can produce strange results sometimes
9. join the list of cleaned words back into a single text and write this out to a .txt file
#import necessary packages
from bs4 import BeautifulSoup as bs
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import LancasterStemmer
#to keep thing simple I have hard coded the text data into the script
text = '''<h2>Additional Information</h2>Viewing Times:<br/><br/>4:45pm - 5:00pm Thursday 15th June<br/>4:45pm - 5:00pm Tues 20th June<br/>4:45pm - 5:00pm Fri 23rd June<br/><br/><br/>
A deceptively spacious and well presented end terrace has recently been refurbished throughout and briefly comprises an open plan kitchen / living / dining area, downstairs w.c.,
four generous bedrooms - master with ensuite and a family bathroom with modern white suite. Additional benefits include gas fired central heating and double glazing throughout.
The current tenant is currently on a month to month basis with a passing rent of £500 per calendar month. <br/><br/>UPVC front door with glazed panel.
<br/><br/>OPEN PLAN KITCHEN / RECEPTION AREA 18' 2" x 14' 6" (5.55m x 4.42m) Modern shaker style kitchen with a range of high and low level units with formica work surfaces.
Integrated oven with matching four ring hob. Stainless steel extractor hood. Stainless steel single drainer sink unit with mixer tap. Space for fridge freezer. Integrated dishwasher.
Recessed low voltage spot lights. <br/><br/>W.C. Modern white suite comprising low flush w.c. Pedestal wash hand basin with mixer tap. Laminate wooden floor.
<br/><br/>BUILT IN STORAGE CUPBOARD Plumbed for washing machine. <br/><br/>FIRST FLOOR LANDING Access to under stair storage. <br/><br/>MASTER BEDROOM 10' 8" x 11' 4" (3.26m x 3.46m)
<br/><br/>ENSUITE White suite comprising low flush w.c. Pedestal wash hand basin. Enclosed shower unit with thermostatic shower. Tiled floor. Extractor fan. Recessed low voltage spot
lights. <br/><br/>BEDROOM 4 8' 3" x 9' 5" (2.52m x 2.89m) <br/><br/>BATHROOM Modern white suite comprising low flush w.c. Panelled bath with mixer tap. Pedestal wash hand basin with
mixer tap. Enclosed shower unit with thermostatic shower. Part tiled wall. Recessed low voltage spot lights. Extractor fan. <br/><br/>SECOND FLOOR LANDING
<br/><br/>BEDROOM 2 14' 3" x 8' 11" (4.36m x 2.72m) (widest points) Built in storage cupboard. Sky light. <br/><br/>BEDROOM 3 14' 6" x 7' 3" (4.43m x 2.22m) sky light.'''
#initialise a couple of lists for use later
suspect_words = 
stem_word_list = 
#this function uses nltk wordnet to check that each word is an actual word
if not wordnet.synsets(word):
#create a soup object then apply the get_text() function to remove tags
soup = bs(text, 'lxml')
text_bs = soup.get_text()
#remove digits and convert everything to lower case
text_bs_alpha = re.sub("[^a-zA-Z]", " ", text_bs.lower())
#use NLTK to tokenize the text, this creates a list of words
tokens = nltk.word_tokenize(text_bs_alpha)
remove stop words and anything of two characters or less, I added this condition because I found I was getting things like 'x' from the dimensions in the original text
word_list = [w for w in tokens if not w in stopwords.words("english") and len(w) >= 3]
#call the function to check each word in the tokenized list
for word in word_list:
#use a stemmer to reduce everything down to stems, I did not like the results I got, for example the word pedestal was reduced to pedest, other stemmers are avaialable, you can also write your own
stemmer = LancasterStemmer()
for word in word_list:
word_stem = stemmer.stem(word)
#join everything back into a single text and write this to a file
text_clean = " ".join(stem_word_list)
with open('clean_text.txt','w') as op_file:
#print out suspect words, this list included 'informationviewing', this was generated by running the get_text() function on 'Information</h2>Viewing', the </h2> was replaced with nothing
#I manually corrected the above problem
Notes: there is no one size fits all script that will do everything you need, it depends on the data you are starting with and what you want to do with the data.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.