The previous post detailed how to write python methods to get some basic descriptive statistics. This post shows how to obtain the same results using R.
step 1: save a copy of the .csv file to the working directory, you can get your working directory from getwd() step 2: in R studio create a new RScript file and save it to the working directory. step 3: create a function which reads the data from the csv file, this is not necessary if you prefer you can just work on the command line, no need for functions: classSize < function(){ fullData < read.csv("class_size.csv") } run this function  hopefully there are no syntax errors, then in the console create a vector to store the data: d < classSize() you can check some of the data using the head function: head(d,10) step 4: I'm interested in the class size data which is the seventh column called 'Total.Pupils' so I want to isolate this column into a new vector If you look on Stackoverflow there are lots of suggestions for how this can be done, many of them seem overly complicated, the easiest way to get one column is: csize < fullData[[7]], you can check this worked using the head() funcion: head(a,20) [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 ..... this looks good so: step 5: I can now focus on generating the required statistics from the csize vector: the function range() returns the lowest and highest values in the vector: > range(a) [1] 1 48 > round(mean(a), 2) [1] 23.55 > round(mean(a), 2) [1] 23.55 > round(sd(a),2) [1] 5.38
0 Comments
I used a data set on class size for Primary schools in Scotland. The data is made available by the Scottish Government and is available here. The code used:
import csv def read_file(filename): numbers = [] with open(filename) as f: reader = csv.reader(f) next(reader) for row in reader: numbers.append(int(row[0])) return numbers def calculate_mean(numbers): s = sum(numbers) N = len(numbers) mean = s/N return mean def find_range(numbers): lowest = min(numbers) highest = max(numbers) r = highest  lowest return r, lowest, highest def find_differences(numbers): mean = calculate_mean(numbers) diff = [] for num in numbers: diff.append(nummean) return diff def calculate_variance(numbers): diff = find_differences(numbers) squared_diff = [] for d in diff: squared_diff.append(d**2) sum_squared_diff = sum(squared_diff) variance = sum_squared_diff/len(numbers) return variance numbers = read_file('classSize.csv') m = calculate_mean(numbers) rng, l, h = find_range(numbers) variance = calculate_variance(numbers) std = variance**0.5 print ('mean = ' + str(round(m, 2))) print ('range = ' + str(rng)) print ('smallest class size = ' + str(l)) print ('largest class size = ' + str(h)) print ('standard deviation = ' + str(round(std, 2))) The code is based on examples from 'Doing Math with Python' by Amit Saha, No Starch Press. Although the book is not specifically aimed at people interested in data science or data analytics it does include chapters on probability, statistics and calculus. The code generated the following statistics: mean = 23.55 (to two decimal places) range = 47 smallest class size = 1 largest class size = 48 standard deviation = 5.38 (to two decimal places) I used Idle as a development environment and saved the .csv file and code in the same directory  this means I don't need a full path for the file name in the code. 
This blog includes:Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology. Archives
June 2018
