Without data you're just another person with an opinion
W. Edwards Deming
I agree with the above quote 100% but even with data it is possible tell different stories especially when the person doing the telling is economic with the details/context. For example.
When looking at casualties in the Vientnam war. We can plot the number of casualties by state:
CA obviously suffered the highest numbers of casualties no one can argue, the visualisation clearly demonstrates this. But CA at the time was one of the most populous states in the USA so we might not be surprised to discover that more casualties came from CA. What happens if we get data on state populations from the 1960s and 70s and normalise the casualty numbers to get casualties per capita:
Now Missouri stands out as the state that suffered the highest casualty rates per capita while CA had one of the lower casualty rates. It is important to include all details - total casualties and total casualties per capita are not the same thing
Another example is the Olympic medal tables. If you assign points to medal type, gold = 3, silver = 2 and bronze = 1 then the top countries in the 2016 Olympic games include:
United States, Great Britain, China, Russia, France, Germany, Japan - this list is not surprising as these countries are amongst the richest and most populous countries on the planet. If we repeat this but this time divide by country population we get: Bahamas, New Zealand, Jamaica, Bahrain, Fiji, Croatia, Armenia, Hungary, Denmark and Georgia, a very different view of the Olympic medal tables. So maybe the success of countries like the USA, the UK and China are due more to their population size and wealth rather than sporting ability.
According to Wolfram Mathworld:
A convenient definition of an outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile
To illustrate this I'll use the Titanic passenger dataset, specifically it has an Age column. Plotting Age as a box plot gives:
So there are outliers at the top end of the data range, the open circles in the plot. If you find yourself in an interview for a data job you might be asked how you would identify and remove outliers from data, this is how you can do it in pandas:
Identify outliers and remove them from the dataset
there are no lower outliers, the upper outliers are: 66, 65, 71, 70, 65, 65, 71, 66, 69.0, 80, 70, 70 and 74.
df_filtered is the dataset minus the outliers.
Correlation is a measure of the connection between variables. For example the amount of leg room on a flight and the cost of the ticket. This video explains the concept well:
You can try this yourself. For example, what makes people happy? The OECD measures life satisfaction and publishes the data, here.
If we take three variables and measure correlation with life satisfaction this gives the following plot.
The darker the shade of grey the stronger the correlation. The top left to bottom right diagonal can be ignored, this is comparing the same fields so equals 1.
The strongest correlation is between satisfaction and earnings, to a lesser extent education and finally personal time. Displaying data visually is often easier to read than when the data is displayed as a table. For example the above plot suggests the strongest correlation is between life satisfaction and earnings.
The function to generate this plot is:
Function to visualise a correlation matrix
In machine learning we want a model that fits the training data well but can also work with data outside the training dataset. We want to minimize under fitting and over fitting. Say we have a data feature y which depends on a set of other data features, x, such that y=f(x). We don't know what this relationship is but we can use machine learning to build a model y=g(x) where g gives reasonable approximations to f. It is very unlikely that g(x) will give the same result as f(x) 100% of the time because models are approximations, cue the spherical chicken joke:
What is underfitting? - according to this article
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. A model with high bias pays very little attention to the training data and over simplifies the model. It always leads to high error on training and test data.
Say we have some data, as illustrated by the leftmost plot above. The data contains two classes, red and blue, maybe these are customers who renew and customers who don't or patients who respond to some treatment and some who don't. Our first attempt at a model (the middle plot) is a simple linear one, it obviously doesn't work since there are too many data points of the wrong class on each side of the decision boundary. Our second attempt (the rightmost plot) also doesn't work - it is an example of overfitting. Yes it will give very good maybe perfect accuracy on the training data, but when you try to apply it to data otside the training set it will crash and burn. This video gives a nice over view of over fitting.
Moral of the story ... if you're getting 100% during testing, you're probably overfitting...