Python offers several libraries that can handle PDF files. The example below uses pdfminer and python 2 on Windows 10. I found the function at the following location.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(pdfFile, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
text = retstr.getvalue()
pdfFile = 'ireland.pdf'
opString = convert_pdf_to_txt(pdfFile)
opFile = open('ireland.txt', 'w')
The code converts the PDF file (which I had downloaded and saved) to a text file and saves it in the same directory. It works best with PDFs which are mostly text and the text is not formatted into multiple columns.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.