The following was inspired by chapter 5 of A.T. Combs' book - Python Machine Learning Blueprints.
I wanted a system that would predict using a support vector machine if an article in an RSS feed might be something I'm interested in and want to read. For convenience I chose the BBC RSS feed but the idea should work with any RSS feed.
SVM is supervised learning so it requires a training dataset. So the first step is to create that training dataset.
The following script will get some data from the BBC RSS feed and load it into an SQLite table.
from bs4 import BeautifulSoup
import sqlite3 as lite
soup = BeautifulSoup(text, 'lxml')
clean_txt = soup.get_text()
#feed = feedparser.parse('http://seclists.org/rss/fulldisclosure.rss')
feed = feedparser.parse('http://feeds.bbci.co.uk/news/business/rss.xml')
con = lite.connect("news.db")
for post in feed.entries:
title_txt = post.title
summary_txt = clean_text(post.summary)
link_txt = post.link
ts = datetime.datetime.now()
cur = con.cursor()
n += 1
cur.execute('INSERT INTO news(title, summary, link, timestamp)VALUES(?,?,?,?)', (title_txt,summary_txt,link_txt,ts))
print('inserted row: ' + str(n))
The above script will populate the table with data like:
I added the final column ('want') in the CREATE TABLE script, I updated this manually with either a 'y' for something that interests me or 'n' for anything else. The create table script was:
Once I had some data in the table I exported it as a csv file. I then split that file in two - one part is the training set with the 'want' column populated. The second is the test dataset with the 'want' values removed. I'm working with text data but SVM wants numeric data so I used tf-idf vectorizer.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn import svm
df = pd.read_csv('newsfeed_train.csv')
vect = TfidfVectorizer(ngram_range=(1,3), stop_words='english')
tv = vect.fit_transform(df['summary'])
clf = LinearSVC()
model = clf.fit(tv, df['want'])
df_test = pd.read_csv('newsfeed_test.csv')
tv_test = vect.transform(df_test['summary'])
The above script will read in the training dataset, create the tf-idf vectorized data, build a model using this data then apply the model to the test data set which contains 24 rows. After running the script I get a list with 24 y's and n's. Since I know what value I put on the want column in the training dataset I can compare the predicted values to my values. The results were 42% successful. A simple script that randomly assigned a 'y' or an 'n' could do as well or better. So what went wrong? The problem was probably due to the amount of data I put into the training dataset - about 50 rows. I would need hundreds or thousands of rows to get a better result. With more data and some fine tuning it may be possible to get a much better result.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.