It is perhaps reasonable to assume that the language used in presidential inaugural speeches would be different from the language used in personal ads. NLTK offers some easy to code ways of measuring those differences.
from nltk.book import *
text = [w.lower()for w in text4 if w.isalpha()]
unique_text = set(text)
fdist = FreqDist([len(w) for w in unique_text])
The code begins by selecting and setting all alphabetic tokens to lower case, this eliminates punctuation from the analysis and ensures that things like 'The' and 'the' are not treated as separate tokens. The code uses the set function to remove duplicates then gets a frequency distribution for the lengths of the remaining tokens. The code produces the following graph.
By comparison applying the same code to the personal ads text gives the following graph:
The obvious difference in the two graphs is the inaugural speeches contain more long words. Words of length 7 are the most common whereas for the personal ads the most common word length is 4 and there are no words longer than 14 letters.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.