Below is a step by step example of how to use R to calculate a correlation. Step 1: find the data to test  I used some data from the LondonDataStore. The data was in two files, an xls file and a csv file. I extracted the data I needed from each file and combined it into one csv file and saved this file in the working directory for R you can use the getwd() function to find your working directory and the setwd() function if you want to change it. Step 2: check that the necessary file is available in the working directory  in R enter list.files(getwd()), you should see your data file listed, if not you need to get it into your working directory step 3: run the cor() function to get a value for the correlation: cor(testData$number.of.smokers, testData$unemployed.percent) When I tried this I got an error: Error in cor(testData$number.of.smokers, testData$unemployed.percent) : 'x' must be numeric So I used the as.numeric() function on the x input: > cor(as.numeric(testData$number.of.smokers), testData$unemployed.percent) [1] 0.2015544 This gives me a correlation value of just over 0.2, suggesting that there is not a strong correlation between median income and estimated number of smokers for boroughs in London. In other words there doesn't seem to be a relationship between how much people earn and whether they smoke or not. If I want to see the data plotted I can use the plot() function: plot(testData$unemployed.percent, testData$number.of.smokers) gives: If there had been a correlation then the points would not have been as randomly spread as in the above.
0 Comments
Leave a Reply. 
This blog includes:Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology. Archives
October 2017
