The following was inspired by the book Python for Informatics by Charles Severance. The mbox file is available here. Information on mbox files is available online. I used notepad++ to write the code on a Windows 8.1 laptop, interaction with the code is via the command line. The following Youtube video explains how to set up a development environment.
Before writing any code it is always good to have a clear idea what the code needs to do. In this case I want to do the following: count the number of emails included in the file, create a list of the email addresses of all the unique senders, write the information to an output text file.
1. import re
2. fhand = open('mbox.txt')
3. names = 
4. fout = open('output.txt', 'w')
5. count = 0
6. for line in fhand:
#count the number of emails sent
7. words = line.split()
8. if len(words) > 0 :
9. if words == 'From':
10. count = count + 1
#extract email addresses of all unique senders
11. line = line.rstrip()
12. x = re.findall('From [a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)
13. if len(x) > 0:
14. if x not in names :
16. sender = str(x) + '\n'
18. fout.write('There are: ' + str(len(names)) + ' unique senders in the file \n')
19. fout.write(str(count) + ' sent emails are included in the file')
21. print ('The output has been written to the file: output.txt')
line 1: the code will use a regular expression to extract data from the file so need to import re
line 2: open the mbox file
line 3: create an empty list
line 4: open the output file for writing
line 5: initialise a variable which will be incremented for each email found in the mbox file
line 6 to line 10 counts the number of emails in the file, this part relies on the fact that mbox files always start each email message with the 'from line' i.e. a line which has the word From (uppercase F followed by lower case r, o, and m) followed by one space followed by the address of the sender.
line 11 to line 17 extracts the senders' addresses and writes them to the output file
lines 18 and 19 write some summary statistics to the output file
line 20: it is good practice to close the file after writing
line 21: the code is run from the command line, the final print statement informs the user that code execution has completed successfully and the output file has been generated.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.