Discovering Data
  • Home
  • Blog

Natural Language Generation with Markovify

4/15/2021

0 Comments

 

What is Natural Language Generation (NLG)?

It is simply the generation of human language using a computer. A very basic example:
using some text from Bram Stoker's novel Dracula we will generate some simple natural language texts. The novel Dracula is in the public domain and is available for download in different formats from the website Gutenberg.

    
This produces the output:

​over I is as Left that splendid to but was

Not surprisingly this is just a random string of words extracted from the input. We can add a small change to improve on this, using ngrams.

ngrams are just groups of consecutive words, the value of n is a positive integer, for example:

​sentence = "
Left Munich at 8:35"

the ngrams are: [('Left Munich'),('Munich at'),('at 8:35')]

    
This produces the output:

the train and a wonderful place next morning; should

Rather than a string of random words we now have a string of random ngrams. Each ngram looks reasonable but they are still combined in a random way. 

The problem is that language is not a random collection of words, some words are more likely to follow other words, for example the word school is more likely to be followed by words like: building, teacher or nurse than words like mist, shelf or heat. This is collocation and we can use it in NLG. We need to introduce some probability.

What is a Markov chain?

Let's say we have a simple system with three states, A, B and C. Perhaps it is a simple climate with just three types of weather: sunny (A), raining (B), windy(C). These three states can only exist in isolation so for example it can not be both sunny and windy. We can represent this system graphically:
Picture

So if today's weather is Sunny (A) we can say there is a probability of 0.2 that tomorrow will also be sunny, there is also a probability of 0.2 that tomorrow will be raining and a probability of 0.6 that it will be windy (C). This is the key concept in Markov chains: the probability of a system being in a certain state is dependent only on the previous state. This idea together with the idea of ngrams can be combined to help with NLG. The Python library Markovify uses thses ideas to generate natural language.

    
This produces the output:

We sat breathing became convulsed.
He looked out into some time.
All this one old witches, who, like a small amount of the howling of us to be your relations at this our hope.

This is less than perfect but we can improve on it by changing the state_size parameter. This parameter changes the number of previous words that are considered when choosing the next word. Let's increase that to 3. Then the output becomes:

The walls were fluffy and heavy with dust, and the shutters were up.
Shortly afterwards, I heard the rustle of actual movement where I had placed my clothes.
When we came into Lucy's room I could see that Jonathan on one side of the window-sill and her eyes shut.

This reads like the text in the original novel, it isn't perfect but in just a few lines of code we are starting to generate some reasonably good natural language.

What are some applications for NLG:
In business it can be used to generate: 
  • email subject lines
  • Facebook ads
  • display ads, SMS messages 
it can also generate weather forecasts, short articles ... but there are limitations especially when generating short or long articles. Over a few sentences the text can seem natural and logical but over several paragraphs it can become incoherent. 
0 Comments

    Author

    Hi, welcome to DiscoveringData, this blog includes code samples, mini tutorials and data exploration on topics that interest me.

    Archives

    April 2021

    Categories

    All

    RSS Feed

Proudly powered by Weebly
  • Home
  • Blog