the previous post described the first stages of an approach which isolates key words and then compares the two lists. This approach is very limited, it is possible to have two completely different texts with similar key words and two similar texts with different words.
This second approach uses scikit-learn rather than nltk. the code:
from sklearn.feature_extraction.text import TfidfVectorizer
with open('input_file_one.txt') as if1:
txt1 = if1.read()
with open('input_file_two.txt') as if2:
txt2 = if2.read()
documents = [txt1,txt2]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T
The output from this is:
(0, 1) 0.495086884274
(0, 0) 1.0
(1, 0) 0.495086884274
(1, 1) 1.0
A score of 1.0 means the two texts are identical.
When I use the script from the previous post with these two .txt files I get the following two lists of words:
[('could', 2), ('remote', 2), ('zendmail', 2), ('vulnerability', 2), ('web', 2), ('forms', 2), ('target', 2), ('Desc', 1), ('execution', 1), ('context', 1), ('application', 1), ('others', 1), ('code', 1), ('registration', 1), ('unauthenticated', 1)]
[('zendmail', 5), ('email', 5), ('using', 4), ('vulnerability', 4), ('address', 3), ('update', 3), ('local', 3), ('provided', 3), ('Framework', 3), ('Zend', 3), ('command', 2), ('sendmail', 2), ('line', 2), ('additional', 2), ('attack', 2)]
tf–idf, is term frequency–inverse document frequency. It is a numerical statistic that gives the importance of a word to a document in a corpus. It is often used as a weighting factor in information retrieval and text mining.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.