This article illustrates the advantages of combining datasets to answer questions. It also includes an example of using web scraping to get data.
The question I want to investigate is: are countries which were amongst the first to give women the vote more open to doing business with other countries? Or put another way I want to test a hypothesis that progressive countries are more open to doing business with foreigners. So I will test if there is there any correlation between the year women were given the vote and how easy it is for other countries to do business in that country.
Step One, get the data:
The data is not available in a ready to download csv file so I scraped the data from Wikipedia. I used a combination of the Requests library and Beautifulsoup:
The script uses requests to get the html, I then create a beautifulsoup object using the html then iterate through the object using nested for statements to get the table, then each row and finally each element within each row. The script puts this data into a list of lists then finally converts that to a pandas dataframe.
Step two, combining data: this dataframe just contains country name and year women got the vote, I then did the same thing to get country GDP and ease of doing business. The reason I converted the list of lists to a pandas dataframe is pandas makes it easy to combine dataframes. They can be merged in a similar way to joining tables in an SQL query, for example:
Once I had the merged dataframe I created a correlation matrix visualisation:
If there is correlation between the year women are given the vote and ease of doing business with that country then I expect it to be negative: because a negative correlation implies countries that are open to doing business with foreigners will have given women the vote earlier than more closed countries. The above diagram sows there is some correlation between the variables: year and eob_num (ease of business converted from a phrase like easy or below average to a numerical value). The exact correlation value can be obtained with just one line of code
the correlation values are:
year - ease of doing business = -0.46
year - GDP = -0.20
ease of doing business - GDP = 0.25
Summary: the correlation values are not very strong but the signs are as expected: negative for the first two and positive for the third. There are factors that I did not include in this analysis for example some countries did not gain independence until quite recently so would not have been in a position to give women the vote until after independence. There are also countries were people have a vote but that vote is meaningless because the government is not democratic or the voting system is corrupt.