One of the weaknesses of codes which always map one letter to one symbol is that they can be cracked using statistical approaches. Each (human) language tends to use some letters more than others giving each language a kind of statistical fingerprint. I wanted to try to investigate this using Python. My goal was to learn about the kind of tools Python has available for this task.
The first piece of code uses urllib and BeautifulSoup to parse the webpage and save it as a .txt file.
from bs4 import BeautifulSoup
#url = 'https://en.wikipedia.org/wiki/Belfast'
#url = 'https://it.wikipedia.org/wiki/Belfast'
url = 'https://af.wikipedia.org/wiki/Belfast'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
tags = soup.find_all('p')
textFile = open('belfast.txt', 'a', encoding='utf-8')
for p in tags:
text = p.text + p.next_sibling
The next part uses collections.Counter() to get the distribution of letters then uses numpy and matplotlib to generate the bar graphs:
import numpy as np
import matplotlib.pyplot as plt
textFile = open('belfast.txt', 'r', encoding='utf-8')
prose = textFile.read().lower()
words = re.findall(r'([a-z])\w+', prose)
labels, values = zip(*sorted(collections.Counter(words).items()))
indexes = np.arange(len(labels))
width = 1
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
The code produced the following graphs:
Distribution for English
Distribution for Afrikaans
Distribution for Italian
The code has one weakness, the regular expression filters out all characters except a to z, many European languages use different accents especially on vowels, these would be excluded from the results. This could be fixed but it was never my goal to do a proper scientific study, my goal was to learn some Python. The graphs do show some interesting differences between the languages: for example Afrikaans uses 'v' a lot while English loves 't' and Italian was the only language to use 'x'.
Kaggle has a number of competitions which are useful when learning data science. I attempted the Titanic competition using R.
Before trying any machine learning I began by analysing the data. In particular it is important to identify missing data especially before attempting any random forest analysis.
Kaggle provides the Titanic data in two datasets, train.csv and test.csv, the first thing I did was to combine these into one dataframe, this required adding the Survived column to the test dataframe:
> train <- read.csv("train.csv")
> test <- read.csv("test.csv")
> test$Survived <- NA # this add a colum to test called Survived, each row has value NA
> all.data <- rbind(train,test)
all.data now contains both train and test data. It is worth checking if columns contain nulls/NAs or blanks, for example the Age column:
So about 20% of Age values are missing. If you want to use a random forest algorithm to model survivability you'll need to fix this. It is possible to just replace all nulls with the average age. A more sophisticated approach is to use the title (Mr, Miss, Master ...) which is buried in the name column. If you extract this into a separate column ten take the mean for each title. You can then replace any missing ages with the appropriate mean based on the title - there are no missing titles.
You can explore the data a little further to see how each variable changed the chance of surviving. For example it is very clear that gender had an important role in survivability.
> (nrow(male.survived)/nrow(male))*100 # where male.survived and male are data frames extracted from all.data
So men had less than 20% or a 1 in 5 chance of surviving whereas about 3 in 4 women survived.
So the most basic model would be to just assign a 0 to Survived for all males in the test data set and a 1 for all females. This can be refined by looking at class. There were 3 ticket classes on the Titanic: first, second and third:
firstSurvived = 63%
secondSurvived = 47%
thirdSurvived = 24%
Then combining gender and class:
> first.survived.male <- subset(first.survived, Sex == "male")
> first.survived.female <- subset(first.survived, Sex == "female")
to sum up:
firstMaleSurvived = 33%
firstFemaleSurvived = 67% % of first class passengers which are female = 43.5%
secondMaleSurvived = 19.5%
secondFemaleSurvived = 80.5% % of second class passengers which are female = 41.3%
thirdMaleSurvived = 39.5%
thirdFemaleSurvived = 60.5% % of second class passengers which are female = 29.3%
So your best chance of survival was to be a female passenger travelling in second class. Men travelling second class had the worst survivability.
If you follow the tutorials (see Kaggle website for links) you'll end up with a model that is just over 80% accurate, the R tutorial uses random forest. You can play around with the variables to improve on this initial attempt.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.