This tutorial requires python, pandas and myplotlib. All of these and many other data science packages are available by installing Anaconda, it is free and safe. There are many online tutorials and videos on installing Anaconda, one useful video available on youtube is here. Tutorial: Given a data set it is often useful to get some summary statistics that describe the data. The following is a beginner level tutorial on using Python and Pandas to get some initial stats and data visualisations for a data set. The data set is small, just ten rows. The data comes from Forbes and YouTube and compares money earned against number of subscribers for ten of YouTubes top earning creators. earned (in millions of $US) subscribers(approximate, in millions) 15 50 8 10 7.5 10 7 22 6 8 6 7 5.5 15 5.5 30 5 7 5 11.5 One question that the data can answer: is there a linear relationship between money earned and number of subscribers? Getting summary statistics requires just a few lines of code: import pandas as pd df = pd.read_csv('youtube.csv') print(df.describe()) this produces the following output: earned subs count 10.000000 10.000000 mean 7.050000 17.050000 std 2.976295 13.728417 min 5.000000 7.000000 25% 5.500000 8.500000 50% 6.000000 10.750000 75% 7.375000 20.250000 max 15.000000 50.000000 So we can see that the mean earnings for the top ten YouTubers is about $7,000,000, but the standard deviation on this is about $3,000,000 which is quite large so maybe we shouldn't read too much into the mean. The describe() function also gives us the quartile values. There is very little difference between 25% and 50% so the bottom half of the rankings all earn approximately the same but the top 50% show more variation, with the top earner getting more than twice the average earnings. We can plot this data to see if there is a simple relationship between the variables: The code used to get this plot is:
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('youtube.csv') plt.scatter(df.subs,df.earned) It is clear that no simple linear relationship exists, or in other words increasing you subscribers by a factor of two does not necessarily mean you'll earn twice as much. There must be other factors involved, however the data set we have to work with does not provide any clues about what these additional factors might be.
0 Comments
Leave a Reply. 
This blog includes:Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology. Archives
June 2018
