The dataset is available on the kaggle website, you'll need a kaggle account.
The following takes an initial look at the Times Higher Education World University Ranking data and the Center for World University Rankings data. The question I'm interested in is: Is the Times data biased towards Universities that use English for instruction? It is difficult to know exactly what language a University uses, in some cases a University may use different languages for different courses and different levels. It is not always possible to guess the language of instruction, for example a number of Dutch and some German universities use English especially at Post Graduate level. However countries which use English as an official language tend to use English as the only teaching medium. So if the Times data is biased towards Universities that use English we might expect universities in English speaking countries to be significantly higher ranked than in the CWUR data. I used R and ggplot to get some initial impressions:
> dt <- read.csv('timesData.csv', header = T)
> dt2011 <- dt[which(dt$year == '2011'),]
> dt2012 <- dt[which(dt$year == '2012'),]
> dt2013 <- dt[which(dt$year == '2013'),]
> dt2014 <- dt[which(dt$year == '2014'),]
> dt2015 <- dt[which(dt$year == '2015'),]
> dt2016 <- dt[which(dt$year == '2016'),]
> cdt <- read.csv('cwurData.csv', header = T)
> cdt2012 <- cdt[which(cdt$year == '2012'),]
> cdt2015 <- cdt[which(cdt$year == '2015'),]
> ggplot(data=dt2016, aes(x=country)) + geom_bar() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
> ggplot(data=cdt2015, aes(x=country)) + geom_bar() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
This code produces two graphs, one for the 2015 CWUR data and one for the 2016 Times data:
The overall shape of the two graphs is very similar. In both graphs the US is clearly in first place. An obvious difference is the UK rankings. In the Times data the UK is clearly in second place while in the CWUR data is in fourth place, behind China and Japan and only just ahead of other Western European countries.
My initial conclusion is that the Times data doesn't discriminate against non English speaking Universities such as those in China and Japan but instead tends to be biased specifically towards UK Universities. The Times data originates in the UK.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.