According to a BBC article the campaigns to leave or remain in the EU has had no effect on people's opinions. I wanted to test this statement. To their credit the BBC published a list of data which they used to reach the above conclusion. Which means I can test their claim.
The first thing to do is scrape the data from the BBc website. I used Python/Beautiful Soup to do this:
from bs4 import BeautifulSoup
csvOutput = 
csvRows = 
url = "http://www.bbc.co.uk/news/special/2016/newsspec_13636/content/english/index.html"
htmlPage = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmlPage, "lxml")
tbl = soup.find("tbody").find_all("tr")
for row in tbl:
cells = row.find_all("td")
date = cells.get_text()
remain = cells.get_text()
leave = cells.get_text()
dontknow = cells.get_text()
company = cells.get_text()
method = cells.get_text()
csvRows = 
ofile = open('euRefData.csv', 'wb')
writer = csv.writer(ofile, delimiter=',')
for element in csvOutput:
One thing to note: usually I use Python 3.x but this time I'm using 2.x, the only difference is with the open command, in 2.x i pass in the parameter 'wb' - without the 'b' you'll probably get blank rows or other unexpected results when writing to the output file. I wrote the data to a CSV file because I want to use ggplot to look at the data.
With the data saved into a .csv file it is easy to import into R:
> dt <- read.csv('euRefData.csv', header = T)
> dt$date <- factor(dt$date, levels=unique(dt$date))
> ggplot(data=dt, aes(x=dt$date, y=remain)) + geom_point() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
> online <- dt[which(dt$method=='online'),]
> ggplot(data=online, aes(x=online$date, y=remain)) + geom_point() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
> phone <- dt[which(dt$method=='phone'),]
> ggplot(data=phone, aes(x=phone$date, y=remain)) + geom_point() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
One thing to note is I converted the date field into a factor, this is a quick way to overcome ggplot's tendency to reorder the dates on the x-axis into alphabetical order. Without factorising the dates appear as:
There is however one problem with this quick fix, if you want to add a best fit line you won't be able to - it doesn't work with factors.
One striking feature of the data is the random spread. For example plotting the remain data:
It is hard to see any significant pattern in this plot. The data consists of both online and phone data. If we filter out the online data and keep only the phone poll data:
Now there seems to be some indication that perhaps the phone polls show a drop in support for remaining in the EU.
In conclusion I would say that the data is very noisy and so it is perhaps unsafe to make any claims.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.