The following script will take a document and compare it to a set of documents to find the document similarities.
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
raw_documents = ["An insecure file permissions vulnerability has been discovered in the official Icecream Ebook Reader v4.53 software. The vulnerability allows local attackers with system user accounts to elevate the access to higher system privileges.",
"A persistent cross site scripting web vulnerability has been discovered in the official Zenario v7.6 content management system.",
"The sensor communicates with the application processor via I2C bus #1, which also provides a firmware update interface.",
"Due to the lack of URI schemes validation any external URI scheme can be invoked by the Microsoft OneDrive iOS application with out any user interaction.",
"While performing network level testing of various Google applications, we discovered that the content for the application did not use SSL."]
gen_docs = [[w.lower() for w in word_tokenize(text)] for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('similarDocs', tf_idf[corpus], num_features = len(dictionary))
query_doc = [w.lower() for w in word_tokenize("The default installation directory for Icecream Ebook Reader is Icecream Ebook Reader with weak folder permissions that grants EVERYONE modify privileges to the contents of the directory and it's subfolders. This allows an attacker opportunity for their own code execution under any other user running the application.")]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_tf_idf = tf_idf[query_doc_bow]
output of the script:
[ 0.48266575 0. 0.01086096 0.13409612 0.17690402]
So the comparison document most closely matches the first document. The least similar is the second document with a score of two.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.