Below is a step by step example of how to use R to calculate a correlation.
Step 1: find the data to test - I used some data from the LondonDataStore. The data was in two files, an xls file and a csv file. I extracted the data I needed from each file and combined it into one csv file and saved this file in the working directory for R you can use the getwd() function to find your working directory and the setwd() function if you want to change it.
Step 2: check that the necessary file is available in the working directory - in R enter list.files(getwd()), you should see your data file listed, if not you need to get it into your working directory
step 3: run the cor() function to get a value for the correlation:
When I tried this I got an error: Error in cor(testData$number.of.smokers, testData$unemployed.percent) :
'x' must be numeric
So I used the as.numeric() function on the x input:
> cor(as.numeric(testData$number.of.smokers), testData$unemployed.percent)
This gives me a correlation value of just over 0.2, suggesting that there is not a strong correlation between median income and estimated number of smokers for boroughs in London. In other words there doesn't seem to be a relationship between how much people earn and whether they smoke or not.
If I want to see the data plotted I can use the plot() function:
plot(testData$unemployed.percent, testData$number.of.smokers) gives:
If there had been a correlation then the points would not have been as randomly spread as in the above.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.