Sometimes when analysing natural language it is necessary to normalise the text. For example converting everything to lower case ensures that 'Inform' and 'inform' are not treated as different words. Sometimes it might also be necessary to treat 'inform' and 'informed' as the same. In those cases stemming and/or lemmatizing will be necessary. But these processes can give unexpected results.
In natural language programming a stemmer will remove affixes from a word leaving only the stem. NLTK comes with a number of in-built stemmers. Comparison of the standard stemmers:
The Porter and Snowball stemmers generate the same result the Lancaster stemmer produces very different results.
Lemmatization is simmilar to stemming except it always generates a valid word rather than a stem. For example:
classify gives classify rather than classifi
differentiate gives differentiate rather than differenti
in the case of words like believes it gives belief whereas the Porter stemmer will give believ
We start with four sentences:
there is a dog in the garden
john enjoys watching movies
jane enjoys taking her dog for a walk
there are no dogs in this movie
Computers don't handle natural language well. One way to overcome this is to create a 'bag of words' for each sentence. This can be thought of as a list of integer values. Each value is the number of occurences of a word in the sentence.
The first step is to tokenize the sentences then to create a set from this list of tokens. We can then itterate through each word in the set and count the number of times that word occurs in each sentence. The number is appended to a list. This process generates three lists or vectors.
Bag of words
[1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1]
the word set is:
set(['a', 'dogs', 'no', 'garden', 'her', 'watching', 'this', 'movie', 'is',
'there', 'dog', 'for', 'walk', 'movies', 'are', 'in', 'jane', 'taking', 'the',
so 'a' occurs once in the first sentence, 'dogs' does not occur, 'no' does not occur
and so on.
The bag of words approach can be used in computer vision for image recognition/classification.
There are a number of packages that offer 'bag of words' functionality, the above script is not
intended to replace these packages, it is just to demonstrate what is meant by 'bag of words'
The following script will take a document and compare it to a set of documents to find the document similarities.
output of the script:
[ 0.48266575 0. 0.01086096 0.13409612 0.17690402]
So the comparison document most closely matches the first document. The least similar is the second document with a score of two.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.