One of the problems of data mining is trying to understand the reasons for patterns in the data. For example:
Boris bikes, proper title = Barclay's Bicycle Hire, is a bicycle hire scheme in London. It started in July 2010. The number of bicycles hired per day is shown below:
Ignoring the first year or so of data (because the scheme was still being established) there appears to be a definite seasonal pattern. The peak is in the summer months and the trough is in the winter. This can be seen more clearly by looking at monthly rather than daily figures:
Monthly visitor numbers to London also show this kind of seasonal pattern:
The data was only available for visitors per quarter but the seasonal pattern does show up. So there may be a link between tourists visiting London and the number of bicycles being hired. In other words tourists might account for a significant part of the total bicycles hired. So I seem to have arrived at a reasonable conclusion based on the data, i.e. that without the tourists the number of cycles hired out could be significantly less. However if you look at the data for cycle flows on the Transport for London Road Network - this is for all cyclists not just the cyclists on the cycle rental scheme - this seasonal pattern also shows up prior to the cycle scheme becoming strongly established.
So maybe the true picture is that more people cycle in London in the summer than in the winter and that more people visit London during the summer season but the only way to determine if tourists make a significant contribution to the number of 'Boris Bikes' being hired out is to get data on the people who are doing the hiring. This data is not included in the dataset available.
The moral of the story being don't assume a correlation implies a causation. The following site has numerous (and humorous) examples of apparent correlations between completely unrelated things.
All the data used above is available here.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.