Discovering Data
  • Home
  • Blog

Building a language identifier part 2

12/11/2016

0 Comments

 
Part 1 used the frequency distribution of characters to identify languages. This time I'm using key words for the three target languages. The code is available in my code blog.

test reults

for 20 word inputs
test    dutch_score    english_score    basque_score
Basque       0                       0                         3
English        0                      5                          0
Dutch          6                      0                          0

for 9 words
Basque       0                       0                         1
English        0                       1                         0
Dutch          2                       0                         0

for 2 words
Basque       0                       0                          0
English        0                       0                         0
Dutch          0                       0                         0

Summary:

Version 2 scored 100% for both 20 and 9 word inputs but failed miserably for the 2 word inputs scoring 0% for each of the three languages. I think a combination of both approaches would produce the most accurate results but it will always be difficult for very short inputs.
0 Comments



Leave a Reply.

    Archives

    June 2018
    December 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017
    December 2016
    November 2016
    October 2016
    September 2016
    August 2016
    July 2016
    June 2016

    RSS Feed

Proudly powered by Weebly
  • Home
  • Blog