This article illustrates the advantages of combining datasets to answer questions. It also includes an example of using web scraping to get data.
The question I want to investigate is: are countries which were amongst the first to give women the vote more open to doing business with other countries? Or put another way I want to test a hypothesis that progressive countries are more open to doing business with foreigners. So I will test if there is any correlation between the year women were given the vote and how easy it is for other countries to do business in that country.
Step One, get the data:
The data is not available in a ready to download csv file so I scraped the data from Wikipedia. I used a combination of the Requests library and Beautifulsoup:
The script uses requests to get the html, I then create a beautifulsoup object using the html then iterate through the object using nested for statements to get the table, then each row and finally each element within each row. The script puts this data into a list of lists then finally converts that to a pandas dataframe.
Step two, combining data: this dataframe just contains country name and year women got the vote, I then did the same thing to get country GDP and ease of doing business. The reason I converted the list of lists to a pandas dataframe is pandas makes it easy to combine dataframes. They can be merged in a similar way to joining tables in an SQL query, for example:
Once I had the merged dataframe I created a correlation matrix visualisation:
If there is correlation between the year women are given the vote and ease of doing business with that country then I expect it to be negative: because a negative correlation implies countries that are open to doing business with foreigners will have given women the vote earlier than more closed countries. The above diagram sows there is some correlation between the variables: year and eob_num (ease of business converted from a phrase like easy or below average to a numerical value). The exact correlation value can be obtained with just one line of code
the correlation values are:
year - ease of doing business = -0.46
year - GDP = -0.20
ease of doing business - GDP = 0.25
Summary: the correlation values are not very strong but the signs are as expected: negative for the first two and positive for the third. There are factors that I did not include in this analysis for example some countries did not gain independence until quite recently so would not have been in a position to give women the vote until after independence. There are also countries were people have a vote but that vote is meaningless because the government is not democratic or the voting system is corrupt.
According to GOV.UK, in England, Wales and Northern Ireland
The police can stop and question you at any time - they can search you depending on the situation.
The rules are different in Scotland.
I wanted to test for any correlation between the number of people who were stopped and searched and the number of people subsequently arrested. If stop and search is effective at catching criminals then I would expect to see some degree of positive correlation. In this case the data is buried in pdf files, probably the worst place to get data. There are a few Python libraries which can help when trying to extract data from PDFs for example pdfminer and the library I used - Tabula. I chose Tabula because it is relatively easy to use when extracting data in tables in PDF files. The code looks like this:
The read_pdf function returns a pandas dataframe, you need to pass in the path-file name and the page/pages. The library worked for me but that is not always the case - PDFs are not easy to work with.
The data is arranged by financial year rather than calendar year so fy_12_13 = April 2012 to March 2013
Plotting the numbers arrested against the numbers stopped for the six years gives
There does not appear to be a strong correlation. Pandas can give the correlation coefficient:
This gives a value of 0.3589143729477498. So not much correlation this implies that increasing the number of people who are stopped and searched would not necessarily result in more arrests.
Without data you're just another person with an opinion
W. Edwards Deming
I agree with the above quote 100% but even with data it is possible tell different stories especially when the person doing the telling is economic with the details/context. For example.
When looking at casualties in the Vientnam war. We can plot the number of casualties by state:
CA obviously suffered the highest numbers of casualties no one can argue, the visualisation clearly demonstrates this. But CA at the time was one of the most populous states in the USA so we might not be surprised to discover that more casualties came from CA. What happens if we get data on state populations from the 1960s and 70s and normalise the casualty numbers to get casualties per capita:
Now Missouri stands out as the state that suffered the highest casualty rates per capita while CA had one of the lower casualty rates. It is important to include all details - total casualties and total casualties per capita are not the same thing
Another example is the Olympic medal tables. If you assign points to medal type, gold = 3, silver = 2 and bronze = 1 then the top countries in the 2016 Olympic games include:
United States, Great Britain, China, Russia, France, Germany, Japan - this list is not surprising as these countries are amongst the richest and most populous countries on the planet. If we repeat this but this time divide by country population we get: Bahamas, New Zealand, Jamaica, Bahrain, Fiji, Croatia, Armenia, Hungary, Denmark and Georgia, a very different view of the Olympic medal tables. So maybe the success of countries like the USA, the UK and China are due more to their population size and wealth rather than sporting ability.
According to Wolfram Mathworld:
A convenient definition of an outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile
To illustrate this I'll use the Titanic passenger dataset, specifically it has an Age column. Plotting Age as a box plot gives:
So there are outliers at the top end of the data range, the open circles in the plot. If you find yourself in an interview for a data job you might be asked how you would identify and remove outliers from data, this is how you can do it in pandas:
Identify outliers and remove them from the dataset
there are no lower outliers, the upper outliers are: 66, 65, 71, 70, 65, 65, 71, 66, 69.0, 80, 70, 70 and 74.
df_filtered is the dataset minus the outliers.
Correlation is a measure of the connection between variables. For example the amount of leg room on a flight and the cost of the ticket. This video explains the concept well:
You can try this yourself. For example, what makes people happy? The OECD measures life satisfaction and publishes the data, here.
If we take three variables and measure correlation with life satisfaction this gives the following plot.
The darker the shade of grey the stronger the correlation. The top left to bottom right diagonal can be ignored, this is comparing the same fields so equals 1.
The strongest correlation is between satisfaction and earnings, to a lesser extent education and finally personal time. Displaying data visually is often easier to read than when the data is displayed as a table. For example the above plot suggests the strongest correlation is between life satisfaction and earnings.
The function to generate this plot is:
Function to visualise a correlation matrix
In machine learning we want a model that fits the training data well but can also work with data outside the training dataset. We want to minimize under fitting and over fitting. Say we have a data feature y which depends on a set of other data features, x, such that y=f(x). We don't know what this relationship is but we can use machine learning to build a model y=g(x) where g gives reasonable approximations to f. It is very unlikely that g(x) will give the same result as f(x) 100% of the time because models are approximations, cue the spherical chicken joke:
What is underfitting? - according to this article
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. A model with high bias pays very little attention to the training data and over simplifies the model. It always leads to high error on training and test data.
Say we have some data, as illustrated by the leftmost plot above. The data contains two classes, red and blue, maybe these are customers who renew and customers who don't or patients who respond to some treatment and some who don't. Our first attempt at a model (the middle plot) is a simple linear one, it obviously doesn't work since there are too many data points of the wrong class on each side of the decision boundary. Our second attempt (the rightmost plot) also doesn't work - it is an example of overfitting. Yes it will give very good maybe perfect accuracy on the training data, but when you try to apply it to data otside the training set it will crash and burn. This video gives a nice over view of over fitting.
Moral of the story ... if you're getting 100% during testing, you're probably overfitting...
Benford's law describes the frequency of the initial digits in datasets of numbers where the numbers span several orders of magnitude. It should be visible if you plot country populations or land areas, it can also show up in accounting. In the past Greece's official economic data has shown the greatest divergence from what Benford's law predicts, see here, at least within the EU.
I tried this myself, creating a dataset with country populations from 2017. Then used the following code to plot the data:
plot value counts of initial digits for country populations
The number 1 is around 30% as expected but some numbers are out of sequence, 7 at the end - should be 9, 3 and 4 are also out of order. But the overall shape is roughly as expected.
You can get population data from places like the U.N.
According to the font of all truthful and accurate knowledge - Wikipedia - "Bayesian probability is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification of a personal belief."
So what does that mean, the YouTube video below gives a good visual explanation.
Here is another explanation:
According to Wikipedia dizygotic (fraternal) twins usually occur when two fertilized eggs are implanted in the uterus wall at the same time while monozygotic (identical) twins occur when a single egg is fertilized to form one zygote (hence, "monozygotic") which then divides into two separate embryos.
Fraternal twins can be mm, mf, fm or ff (where m = male and f = female), identical twins can only be mm, or ff.
For the sake of this example let's say the probability of each option is equal, so P(mm) = P(mf) = P(fm) = P(ff) = 0.25 for Fraternal twins and P(mm) = P(ff) = 0.5 for identical twins. The probability that twins are identical is P(I) = 0.1 so P(F) = 0.9 (probability of Fraternal), assuming twins must be either identical or fraternal (not strictly true but let's not make things too complicated).
If we have two brothers who are twins what is the probability that they are identical twins?
The non-Baysean answer might be 0.1 or 10% because I said above that 10% of twins are identical. However the Bayesian approach gives a different answer:
The probability of identical twins given that both twins are brothers written as P(I|B) = P(B|I)P(I)/P(B)
and since we are assuming twins must be either identical or fraternal then: P(B) = P(B|I)P(I) + P(B|F)P(F)
substituting this into the above gives: P(I|B) = P(B|I)P(I)/P(B|I)P(I) + P(B|F)P(F)
then putting in the numbers gives (0.5 x 0.1)/((0.5 x 0.1) + (0.25 x 0.9)) = 2/11 (about 18.2%) - so the knowledge that both twins are male makes the probability they are identical higher.
The video below makes a good point about the advantages and at least one disadvantage of a Bayesian approach.
The term regression originated with a 19th century English guy. His name was Galton and he loved to measure stuff, for example he measured the height of people who had tall parents and found that their average height was less than the parents' average height. He called this regression to the mean. The name 'regression' stuck.
In machine learning you'll come across two common algorithms that include the term regression in their title.
Linear Regression Example
Linear regression is all about predicting numerical values, for example the number of customers in a restaurant on a given day, the price of some commodity or in the example below, the maximum temperature for a given minimum temperature. Using a dataset of weather observations recorded during the second world war we can use some linear regression to build a predictive model. The dataset contains min and max temperatures for each day, we can plot this:
There is some scatter but the plot is quite linear. So this seems to be a good case for using linear regression. Linear regression has the following relationship between the input x and the output y:
y = mx + b, m is the gradient of the line and b is the intercept
Linear regression is all about predicting a numerical value. Logistic regression however is about predicting which class something belongs to. In the example below I use a list of Titanic passengers to classify which passengers survived and which died. The code uses two thirds of the rows as training data then attempts to predict the Survived column value for the remaining one third.
Linear regression is used to predict numerical values, it can be extended to include non-linear regression for example see here. While logistic regression is used in classification problems, real world examples could include classifying customers into categories, classifying network activity into benign or suspicious activity ...