The following is not a full answer that you can just copy and paste but it is a strong hint.
They give you some code that you can download and use as a starting point. The first few lines of my solution are:
conn = sqlite3.connect('week2.sqlite') # any name is ok for the file as long as it has the sqlite extension
cur = conn.cursor()
cur.execute('CREATE TABLE Counts (org TEXT, count INTEGER)') # just use the SQL they give you
fh = open( 'mbox.txt') # I simplified this part of the code, I just hard coded the file name
# The next part of the code does require more changes.
# The example code is counting emails but you need to count organisations so be careful, for example it is #possible for numerous people to send emails from the same organisation, the example code will count
# each of these emails as unique but you only want to count the organisation as unique, so you need to
# further separate out the organisation from the email. The first part of my code is the same as the example # code:
for line in fh:
if not line.startswith('From: ') : continue
pieces = line.split()
email = pieces
# the next part of my code isolated the organisation from email - one way to do this is just the same as how # email is isolated, I am not including my code for that task
#The last part is pretty much the same as the example code, but simplified:
cur.execute('SELECT count FROM Counts WHERE org = ? ', (org, ))
row = cur.fetchone()
if row is None:
cur.execute('''INSERT INTO Counts (org, count)
VALUES ( ?, 1 )''', ( org, ) )
cur.execute('UPDATE Counts SET count=count+1 WHERE org = ?', (org, ))
You should find that the organisation with the highest count (536) is Indiana University (iupui.edu) and The University of Michigan (umich.edu) is second with a count of 491.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.