Discovering Data
  • Home
  • Blog
  • become_a_data_scientist
  • Code-examples

Stemming and Lemmatizing

6/30/2017

0 Comments

 
Sometimes when analysing natural language it is necessary to normalise the text. For example converting everything to lower case ensures that 'Inform' and 'inform' are not treated as different words. Sometimes it might also be necessary to treat 'inform' and 'informed' as the same. In those cases stemming and/or lemmatizing will be necessary. But these processes can give unexpected results.

In natural language programming a stemmer will remove affixes from a word leaving only the stem. NLTK comes with a number of in-built stemmers. Comparison of the standard stemmers:
word
Porter
Lancaster
Snowball
fasten
fasten
fast
fasten
classify
classifi
class
classifi
awaken
awaken
awak
awaken
differentiate
differenti
​differenty
differenti
duplicate
duplic
duply
duplic
specialise
specialis
spec
specialis
lying
lie
lying
lie
The Porter and Snowball stemmers generate the same result the Lancaster stemmer produces very different results.
Lemmatization is simmilar to stemming except it always generates a valid word rather than a stem. For example:
classify gives classify rather than classifi
differentiate gives differentiate rather than  differenti
in the case of words like believes it gives belief whereas the Porter stemmer will give believ
0 Comments



Leave a Reply.

    This blog includes:

    Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts.  Also includes some explanations of basic data science terminology.

    Archives

    October 2018
    June 2018
    April 2018
    June 2017
    April 2017
    March 2017
    February 2017
    January 2017
    November 2016
    September 2016
    July 2016
    June 2016
    May 2016
    December 2015
    November 2015
    April 2015
    December 2014
    November 2014

    RSS Feed

Proudly powered by Weebly
  • Home
  • Blog
  • become_a_data_scientist
  • Code-examples