Discovering Data
  • Home
  • Blog
  • become_a_data_scientist
  • Code-examples

Bag of words - example

6/24/2017

0 Comments

 
We start with four sentences:

there is a dog in the garden
john enjoys watching movies
jane enjoys taking her dog for a walk
there are no dogs in this movie


Computers don't handle natural language well. One way to overcome this is to create a 'bag of words' for each sentence. This can be thought of as a list of integer values. Each value is the number of occurences of a word in the sentence.

The first step is to tokenize the sentences then to create a set from this list of tokens. We can then itterate through each word in the set and count the number of times that word occurs in each sentence. The number is appended to a list. This process generates three lists or vectors.
Bag of words

    
The ouput:
[1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1]

​the word set is:


set(['a', 'dogs', 'no', 'garden', 'her', 'watching', 'this', 'movie', 'is', 
'there', 'dog', 'for', 'walk', 'movies', 'are', 'in', 'jane', 'taking', 'the', 
'john', 'enjoys'])

so 'a' occurs once in the first sentence, 'dogs' does not occur, 'no' does not occur 
and so on.
The bag of words approach can be used in computer vision for image recognition/classification.
There are a number of packages that offer 'bag of words' functionality, the above script is not
intended to replace these packages, it is just to demonstrate what is meant by 'bag of words'
0 Comments



Leave a Reply.

    This blog includes:

    Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts.  Also includes some explanations of basic data science terminology.

    Archives

    October 2018
    June 2018
    April 2018
    June 2017
    April 2017
    March 2017
    February 2017
    January 2017
    November 2016
    September 2016
    July 2016
    June 2016
    May 2016
    December 2015
    November 2015
    April 2015
    December 2014
    November 2014

    RSS Feed

Proudly powered by Weebly
  • Home
  • Blog
  • become_a_data_scientist
  • Code-examples