The following code will mine security data from seclists.org/rss/bugtraq.rss
url = 'http://seclists.org/rss/bugtraq.rss'
d = feedparser.parse(url)
entries = d['entries']
num = len(d.entries)
data = 
for i in range(num):
title = d.entries[i].title
description = d.entries[i].description
You can easily add regular expressions to extract further data from the feed. However this particular rss feed only gives a shortened version of the original email, you need to parse the html to get access to all the data.
Data in xml format is maybe not as common as other formats like json or csv but it is worth learning how to parse xml. Python offers various possible ways to parse xml, I like BeautifulSoup. I am using a file from TheyWorkForYou.com which makes statistics on UK politicians available online. One project they ran, called write to them, measured politicians response rate to written queries. The data can be downloaded as an xml file. The file is called mps.xml and has the following structure:
<?xml version="1.0" encoding="UTF-8"?>
constituency="Colne Valley" party="Conservative"
writetothem_responsiveness_fuzzy_response_description_2015="very high" />
The first tag just gives some technical detail on the version and encoding used. The top level tag is <writetothem>
within this is several hundred 'personinfo' tags, each of these tags includes the data I want. If I want to extract
names I can use:
from bs4 import BeautifulSoup as bs
with open('mps.xml', 'r') as in_f:
data = in_f.read()
soup = bs(data, 'lxml')
politicians = soup.find_all('personinfo')
for mp in politicians:
Explanation of the above code:
line 1: we are using Beautiful Soup so we need to import it
line2: 'with open(......', this structure is useful, it handles problems such as not being able to find the file. If
the file is not in the same directory as the python script then you need the directory path
line 3: creates the Beautiful soup object, we are parsing xml so need to use python's xml parser - 'lxml'
line 4: find all of the 'personinfo' tags
line5: iterate through these tags
line 6: for each 'personinfo' tag print the name value
This code includes a solution to one problem which might be useful to others - how to get the first digit of each element in a column and use these to create a new column.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('pops.csv')
df['first_digit'] = df['population'].astype(str).str
df1 = df.first_digit.value_counts()
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.