Kaggle has a number of competitions which are useful when learning data science. I attempted the Titanic competition using R.
Before trying any machine learning I began by analysing the data. In particular it is important to identify missing data especially before attempting any random forest analysis.
Kaggle provides the Titanic data in two datasets, train.csv and test.csv, the first thing I did was to combine these into one dataframe, this required adding the Survived column to the test dataframe:
> train <- read.csv("train.csv")
> test <- read.csv("test.csv")
> test$Survived <- NA # this add a colum to test called Survived, each row has value NA
> all.data <- rbind(train,test)
all.data now contains both train and test data. It is worth checking if columns contain nulls/NAs or blanks, for example the Age column:
So about 20% of Age values are missing. If you want to use a random forest algorithm to model survivability you'll need to fix this. It is possible to just replace all nulls with the average age. A more sophisticated approach is to use the title (Mr, Miss, Master ...) which is buried in the name column. If you extract this into a separate column ten take the mean for each title. You can then replace any missing ages with the appropriate mean based on the title - there are no missing titles.
You can explore the data a little further to see how each variable changed the chance of surviving. For example it is very clear that gender had an important role in survivability.
> (nrow(male.survived)/nrow(male))*100 # where male.survived and male are data frames extracted from all.data
So men had less than 20% or a 1 in 5 chance of surviving whereas about 3 in 4 women survived.
So the most basic model would be to just assign a 0 to Survived for all males in the test data set and a 1 for all females. This can be refined by looking at class. There were 3 ticket classes on the Titanic: first, second and third:
firstSurvived = 63%
secondSurvived = 47%
thirdSurvived = 24%
Then combining gender and class:
> first.survived.male <- subset(first.survived, Sex == "male")
> first.survived.female <- subset(first.survived, Sex == "female")
to sum up:
firstMaleSurvived = 33%
firstFemaleSurvived = 67% % of first class passengers which are female = 43.5%
secondMaleSurvived = 19.5%
secondFemaleSurvived = 80.5% % of second class passengers which are female = 41.3%
thirdMaleSurvived = 39.5%
thirdFemaleSurvived = 60.5% % of second class passengers which are female = 29.3%
So your best chance of survival was to be a female passenger travelling in second class. Men travelling second class had the worst survivability.
If you follow the tutorials (see Kaggle website for links) you'll end up with a model that is just over 80% accurate, the R tutorial uses random forest. You can play around with the variables to improve on this initial attempt.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.