This tutorial may be of use to people new to R or ggplot or both. Part two gives an example using ggplot to do some linear regression visualisation. The vector is a basic data type in R. It can have zero or more values, in non computing terms think of it as a named list of values, for example I can create a vector called v and give it the values 1,2,3 In the console window in RStudio, or on the R command line: v < c(1,2,3) I can create as many of these vectors as I want: v < c(1,2,3) w < c(4,5,6) x < c(7,8,9) The next data type is the data frame. You can think of it as one step up from the basic vector in the sense that data frames can contain multiple vectors, like vectors data frames are named. In the example below I create a data frame which I call A and add the three vectors v, w, x to it: A < data.frame(v,w,x) You can see what is in the data frame just by typing A and then return: A v w x 1 1 4 7 2 2 5 8 3 3 6 9 You could think of a data frame as an excel spreadsheet or a matrix (however unlike a matrix, a data frame can contain a mix of data types, for example I can add more columns into the data frame. I want to add a new column called y with the values a,b,c, there are several ways to do this, one method is: A$y < c('a','b','c') The column y has been added to the data frame A v w x y 1 1 4 7 a 2 2 5 8 b 3 3 6 9 c Note that I had to use '' with the values a, b and c, if I didn't I'd get an error: A$y < c(a,b,c) Error: object 'a' not found R thinks that a,b and c are objects like our vectors v,w,x,y or the data frame A. Using '' tells R that we mean the characters a, b and c, you can use single quotes if you prefer. You can access parts of the data frame as follows: you want to see all values in a column: A$w [1] 4 5 6 Or you want to see one specific value, for example the second value in the w column: A$w[2] [1] 5 you can assign this value to a variable: val < A$w[2] You can then perform different actions on the variable: val + 3 [1] 8 val^2 [1] 25 and so on. Usually of course we don't use hard coded data, we use data that we want to investigate or model. The function read.csv() is useful Part twoYou can use the correlation function in R, cor(), to test if a linear relationship exists between variables. The cor function has three parameters: cor(data, use="complete.obs", method="kendall") The first parameter is the name of the dataframe, the second parameter specifies how the function should handle missing data, "complete.obs" assumes no missing data, see the R documentation on line for further examples. The third parameter is the correlation method. There are three to choose from: Pearson, Kendall and Spearman. As an example I will use data for this blog over the last 4 weeks. The data has two variables, the number of unique visitors per day and the total number of page views per day. To test for a correlation I used: cor(d, use="complete.obs", method="kendall") visitors pageViews visitors 1.0000000 0.6963154 pageViews 0.6963154 1.0000000 This implies some degree of correlation (0.7 when rounded to one decimal place) Using method='spearman' gives a higher correlation: cor(d, use="complete.obs", method="spearman") visitors pageViews visitors 1.0000000 0.8453994 pageViews 0.8453994 1.0000000 So let's assume that the two variables have a linear relationship. I can use R to test this idea. A linear relationship has the form y = mx + c, where y and x are the variables, in this case y = page views and x = visitors. m is the gradient of the line and c is the value at which the line cuts the y axis. I can use the lm() function to get these values for my data. This function can be used to fit linear models, see the documentation for details. a < lm(pageViews ~ visitors, data = d) a Call: lm(formula = pageViews ~ visitors, data = d) Coefficients: (Intercept) visitors 14.111 1.215 So my model is: pageViews = 1.2*visitors + 14.1 I can use ggplot2 to generate a scatterplot: ggplot() + geom_point(data = d, aes(x = visitors, y = pageViews)) I can then add the line to this plot and set the scales on the axes: sp < ggplot() + geom_point(data = d, aes(x = visitors, y = pageViews)) abl < geom_abline(intercept = 14.111, slope = 1.215) x < scale_x_continuous(limits = c(0, 40)) y < scale_y_continuous(limits = c(0, 75)) sp + abl + x + y Notice that with ggplot I can build up the graph by adding extra things, I could also add a title, change colours, change the plot shape an so on.
How good is the model? I have my doubts about the accuracy of the mode at values less than 10 visitors because the linear model predicts that when the number of visitors is zero there will be 14 page views  but of course no visitors would mean no page views. Having a model that only works with a finite range is not a disaster but I would need to make people aware that the model is only an approximation and may have limits beyond which it does not work.
0 Comments
Leave a Reply. 
This blog includes:Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology. Archives
June 2018
