I have been teaching myself various different data analytics/data science tools and I wanted a project to try out some of these tools. Language identification is, for me, an interesting and non-trivial project. By language I mean human language such as English, French, Russian etc. Some languages can be identified in the written form because they use unique scripts, for example Hangul for Korean, Hiragana and Katakana for Japanese, Hanzi for Chinese and so on. I wanted to build a tool that can distinguish between written forms of languages that all use the Roman alphabet.
The requirements for the tool are that it can take a sentence or two of input language, parse it and make a decision on the language involved. The goal is not to create a commercially viable product but rather to have fun building a product which allows me to employ some of the tools I have been learning. I will use python and some of the libraries available to me such as NLTK.
There are a number of different approaches that I can take, the first one I want to investigate uses the frequency distribution of characters to identify the language. The code used to determine the frequency distribution of letters in some languages is in my code blog. The initial target languages are English, Dutch and Basque.
Once I generated the frequency distribution graphs for the three languages (shown below) I looked for some identifying features:
Frequency distribution of letters in a sample text written in Basque.
Frequency distribution of letters in a sample text written in English.
Frequency distribution of letters in a sample text written in Dutch.
In English the vowels 'e' and 'o' are more common than 'a' also t and y are used often. 'j', 'q' and 'x' are not common but this does not help because they are not common in Basque or Dutch either.
In Dutch 'e' and 'a' are more common than 'o'. 'y' is not common. 'v', 'w' and 'z' are used with almost equal frequency whereas in English 'w' is much more common than either 'v' or 'z' and in Basque 'z' is very common whereas 'w' and 'v' are not.
Basque uses 'a' more than 'e'. 'k' is more common in Basque than in either Dutch or English.
So the initial algorithm will use the usage frequency of:
'a' and 'e'
'k', 'l' and 'm'
'v', 'w' and 'z'
to decide the languge being used
The code for the language identifier is in my code blog. The test results:
I tested the tool using input text from online news sources in the target language. Each test used one or two short sentences as input, I chose input that contains only the target language, trying to identify the main language from input that contains examples of other languages is a more complex task.
for 20 word inputs
test dutch_score english_score basque_score
Basque 1 0 4
English 2 4 0
Dutch 4 2 1
for 9 words
Basque 1 0 4
English 3 3 0
Dutch 4 2 1
for 2 words
Basque 3 1 2
English 2 2 1
Dutch 4 2 1
The results are not surprising, given that the identification is based on a statistical analysis, the tool works best for longer test inputs so inputs of at least 20 words (100% accurate), less well for the shorter 9 word input (67%) and quite poor for the two word input - just 33%.
The algorithm works best for longer inputs. It is better at recognising Basque and is poor at distinguishing English from Dutch. The approach I took can handle 3 languages, but if the number of languages increased beyond 10 the algorithm would become much more complicated. Next time I will use NLTK to identify stop words then use those stop words to identify the language.