I'm working on a project that will take two text files and compare their contents to see if they are about the same subject matter. The subject matter will be security vulnerability posts. The script below is a quick version 1, it eliminates stop words then gets the top ten most common words from each file. The lists can be compared visually.
from nltk.corpus import stopwords
with open('input_file_one.txt') as inputfile:
text = inputfile.read()
text = text.translate(None,string.punctuation)
tokens = nltk.word_tokenize(text)
content = [w for w in tokens if w.lower() not in stopwords.words('english')]
fdist = nltk.FreqDist(content)
with open('input_file_two.txt') as inputfile:
text2 = inputfile.read()
text2 = text2.translate(None,string.punctuation)
tokens2 = nltk.word_tokenize(text2)
content2 = [w for w in tokens2 if w.lower() not in stopwords.words('english')]
fdist2 = nltk.FreqDist(content2)
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.