Without data you're just another person with an opinion
W. Edwards Deming
I agree with the above quote 100% but even with data it is possible tell different stories especially when the person doing the telling is economic with the details/context. For example.
When looking at casualties in the Vientnam war. We can plot the number of casualties by state:
CA obviously suffered the highest numbers of casualties no one can argue, the visualisation clearly demonstrates this. But CA at the time was one of the most populous states in the USA so we might not be surprised to discover that more casualties came from CA. What happens if we get data on state populations from the 1960s and 70s and normalise the casualty numbers to get casualties per capita:
Now Missouri stands out as the state that suffered the highest casualty rates per capita while CA had one of the lower casualty rates. It is important to include all details - total casualties and total casualties per capita are not the same thing
Another example is the Olympic medal tables. If you assign points to medal type, gold = 3, silver = 2 and bronze = 1 then the top countries in the 2016 Olympic games include:
United States, Great Britain, China, Russia, France, Germany, Japan - this list is not surprising as these countries are amongst the richest and most populous countries on the planet. If we repeat this but this time divide by country population we get: Bahamas, New Zealand, Jamaica, Bahrain, Fiji, Croatia, Armenia, Hungary, Denmark and Georgia, a very different view of the Olympic medal tables. So maybe the success of countries like the USA, the UK and China are due more to their population size and wealth rather than sporting ability.
According to Wolfram Mathworld:
A convenient definition of an outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile
To illustrate this I'll use the Titanic passenger dataset, specifically it has an Age column. Plotting Age as a box plot gives:
So there are outliers at the top end of the data range, the open circles in the plot. If you find yourself in an interview for a data job you might be asked how you would identify and remove outliers from data, this is how you can do it in pandas:
Identify outliers and remove them from the dataset
there are no lower outliers, the upper outliers are: 66, 65, 71, 70, 65, 65, 71, 66, 69.0, 80, 70, 70 and 74.
df_filtered is the dataset minus the outliers.
Correlation is a measure of the connection between variables. For example the amount of leg room on a flight and the cost of the ticket. This video explains the concept well:
You can try this yourself. For example, what makes people happy? The OECD measures life satisfaction and publishes the data, here.
If we take three variables and measure correlation with life satisfaction this gives the following plot.
The darker the shade of grey the stronger the correlation. The top left to bottom right diagonal can be ignored, this is comparing the same fields so equals 1.
The strongest correlation is between satisfaction and earnings, to a lesser extent education and finally personal time. Displaying data visually is often easier to read than when the data is displayed as a table. For example the above plot suggests the strongest correlation is between life satisfaction and earnings.
The function to generate this plot is:
Function to visualise a correlation matrix
In machine learning we want a model that fits the training data well but can also work with data outside the training dataset. We want to minimize under fitting and over fitting. Say we have a data feature y which depends on a set of other data features, x, such that y=f(x). We don't know what this relationship is but we can use machine learning to build a model y=g(x) where g gives reasonable approximations to f. It is very unlikely that g(x) will give the same result as f(x) 100% of the time because models are approximations, cue the spherical chicken joke:
What is underfitting? - according to this article
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. A model with high bias pays very little attention to the training data and over simplifies the model. It always leads to high error on training and test data.
Say we have some data, as illustrated by the leftmost plot above. The data contains two classes, red and blue, maybe these are customers who renew and customers who don't or patients who respond to some treatment and some who don't. Our first attempt at a model (the middle plot) is a simple linear one, it obviously doesn't work since there are too many data points of the wrong class on each side of the decision boundary. Our second attempt (the rightmost plot) also doesn't work - it is an example of overfitting. Yes it will give very good maybe perfect accuracy on the training data, but when you try to apply it to data otside the training set it will crash and burn. This video gives a nice over view of over fitting.
Moral of the story ... if you're getting 100% during testing, you're probably overfitting...
Benford's law describes the frequency of the initial digits in datasets of numbers where the numbers span several orders of magnitude. It should be visible if you plot country populations or land areas, it can also show up in accounting. In the past Greece's official economic data has shown the greatest divergence from what Benford's law predicts, see here, at least within the EU.
I tried this myself, creating a dataset with country populations from 2017. Then used the following code to plot the data:
plot value counts of initial digits for country populations
The number 1 is around 30% as expected but some numbers are out of sequence, 7 at the end - should be 9, 3 and 4 are also out of order. But the overall shape is roughly as expected.
You can get population data from places like the U.N.
According to the font of all truthful and accurate knowledge - Wikipedia - "Bayesian probability is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification of a personal belief."
So what does that mean, the YouTube video below gives a good visual explanation.
Here is another explanation:
According to Wikipedia dizygotic (fraternal) twins usually occur when two fertilized eggs are implanted in the uterus wall at the same time while monozygotic (identical) twins occur when a single egg is fertilized to form one zygote (hence, "monozygotic") which then divides into two separate embryos.
Fraternal twins can be mm, mf, fm or ff (where m = male and f = female), identical twins can only be mm, or ff.
For the sake of this example let's say the probability of each option is equal, so P(mm) = P(mf) = P(fm) = P(ff) = 0.25 for Fraternal twins and P(mm) = P(ff) = 0.5 for identical twins. The probability that twins are identical is P(I) = 0.1 so P(F) = 0.9 (probability of Fraternal), assuming twins must be either identical or fraternal (not strictly true but let's not make things too complicated).
If we have two brothers who are twins what is the probability that they are identical twins?
The non-Baysean answer might be 0.1 or 10% because I said above that 10% of twins are identical. However the Bayesian approach gives a different answer:
The probability of identical twins given that both twins are brothers written as P(I|B) = P(B|I)P(I)/P(B)
and since we are assuming twins must be either identical or fraternal then: P(B) = P(B|I)P(I) + P(B|F)P(F)
substituting this into the above gives: P(I|B) = P(B|I)P(I)/P(B|I)P(I) + P(B|F)P(F)
then putting in the numbers gives (0.5 x 0.1)/((0.5 x 0.1) + (0.25 x 0.9)) = 2/11 (about 18.2%) - so the knowledge that both twins are male makes the probability they are identical higher.
The video below makes a good point about the advantages and at least one disadvantage of a Bayesian approach.
The term regression originated with a 19th century English guy. His name was Galton and he loved to measure stuff, for example he measured the height of people who had tall parents and found that their average height was less than the parents' average height. He called this regression to the mean. The name 'regression' stuck.
In machine learning you'll come across two common algorithms that include the term regression in their title.
Linear Regression Example
Linear regression is all about predicting numerical values, for example the number of customers in a restaurant on a given day, the price of some commodity or in the example below, the maximum temperature for a given minimum temperature. Using a dataset of weather observations recorded during the second world war we can use some linear regression to build a predictive model. The dataset contains min and max temperatures for each day, we can plot this:
There is some scatter but the plot is quite linear. So this seems to be a good case for using linear regression. Linear regression has the following relationship between the input x and the output y:
y = mx + b, m is the gradient of the line and b is the intercept
Linear regression is all about predicting a numerical value. Logistic regression however is about predicting which class something belongs to. In the example below I use a list of Titanic passengers to classify which passengers survived and which died. The code uses two thirds of the rows as training data then attempts to predict the Survived column value for the remaining one third.
Linear regression is used to predict numerical values, it can be extended to include non-linear regression for example see here. While logistic regression is used in classification problems, real world examples could include classifying customers into categories, classifying network activity into benign or suspicious activity ...