The following two scripts can be used to convert .docx files to .txt files or to mine data from docx files. The first script uses the python-docx library which can be installed using pip install python-docx, the documentation and example code is here.
from docx import Document
document = Document('CV.docx')
opFile = open('CV.txt', 'a')
for p in document.paragraphs:
txt = p.text.encode("utf8")
The second approach is based on the example given in 'Web Scraping with Python' by Ryan Mitchell, published by O'Reilly. This second approach gives you access to the XML file which stores the data, and uses Beautiful Soup to parse that XML. This second approach maybe gives more control because you have access to the beautifulsoup object functions to search for the sections of the file you need.
from zipfile import ZipFile
from bs4 import BeautifulSoup
document = ZipFile('CV.docx')
xml_content = document.read('word/document.xml')
wordObj = BeautifulSoup(xml_content.decode('utf-8'), 'lxml')
textStrings = wordObj.findAll("w:t")
opFile = open('CV2.txt', 'a')
for textElem in textStrings:
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.