One of the weaknesses of codes which always map one letter to one symbol is that they can be cracked using statistical approaches. Each (human) language tends to use some letters more than others giving each language a kind of statistical fingerprint. I wanted to try to investigate this using Python. My goal was to learn about the kind of tools Python has available for this task.
The first piece of code uses urllib and BeautifulSoup to parse the webpage and save it as a .txt file.
from bs4 import BeautifulSoup
#url = 'https://en.wikipedia.org/wiki/Belfast'
#url = 'https://it.wikipedia.org/wiki/Belfast'
url = 'https://af.wikipedia.org/wiki/Belfast'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
tags = soup.find_all('p')
textFile = open('belfast.txt', 'a', encoding='utf-8')
for p in tags:
text = p.text + p.next_sibling
The next part uses collections.Counter() to get the distribution of letters then uses numpy and matplotlib to generate the bar graphs:
import numpy as np
import matplotlib.pyplot as plt
textFile = open('belfast.txt', 'r', encoding='utf-8')
prose = textFile.read().lower()
words = re.findall(r'([a-z])\w+', prose)
labels, values = zip(*sorted(collections.Counter(words).items()))
indexes = np.arange(len(labels))
width = 1
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
The code produced the following graphs:
Distribution for English
Distribution for Afrikaans
Distribution for Italian
The code has one weakness, the regular expression filters out all characters except a to z, many European languages use different accents especially on vowels, these would be excluded from the results. This could be fixed but it was never my goal to do a proper scientific study, my goal was to learn some Python. The graphs do show some interesting differences between the languages: for example Afrikaans uses 'v' a lot while English loves 't' and Italian was the only language to use 'x'.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.