Nick Sellen

Tools for extracting text from PDFs and Microsoft Word documents

07 Jan 2010

Have you ever opened a PDF or a Microsoft Word document to see what they're really made of?

It would look a little something like this

Total gibberish to look at. You need to know the secret code to get meaningful content out of them.

This post isn't about the formats, I don't have any special knowledge about them anyway. This is a quick roundup of the tools that can be used to extract text.

The contenders

xpdf is a suite of tools including a PDF reader and the all-important text extractor. it wouldn't compile straight off on my mac so I installed the macports version (with its bazillion dependencies). To run it:

pdftotext EvacuationGuide.pdf

Apache PDFBox is a Java library that performs a range of operations relating to PDFs including text extraction. To run it I added every library in sight to the classpath:

java -cp ant.jar:bcmail-jdk14-136.jar:bcprov-jdk14-136.jar:commons-logging-1.1.1.jar:icu4j-3.8.jar:jempbox-0.8.0-incubating.jar:junit-3.8.2.jar:lucene-core-2.4.1.jar:lucene-demos-2.4.1.jar:pdfbox-0.8.0-incubating.jar:fontbox-0.8.0-incubating.jar org.apache.pdfbox.ExtractText EvacuationGuide.pdf

Apache POI is a well featured Java API for Microsoft Documents - however their Microsoft Word expert appears to have gone elsewhere ("he is working for a company now that signed a NDA with Microsoft"). It didn't come with a command line option so I created an executable jar, the juicy bit of the code is:

String filename = args[0];
FileInputStream fileStream = new FileInputStream(filename);

if (filename.endsWith("docx")) {
	XWPFWordExtractor xwpfExtractor = new XWPFWordExtractor(new XWPFDocument(fileStream));	
} else {
	WordExtractor extractor = new WordExtractor(new HWPFDocument(fileStream));				

A bit of a google brought up a quick and dirty script that will work with Word 2007 (docx) format. It didn't quite work on my mac as-is so I modified it a bit.

unzip -oq $1 -d /tmp/MS   
tr "<" "\012" < /tmp/MS/word/document.xml | grep ^w:t | cut '-d>' -f2 | uniq > $1.plain
rm -r /tmp/MS 

Apache Tika is a handy Java toolkit that wraps up PDFBox, POI, and numerous other products to provide a common interface. It uses maven to build it which failed with a java heap space problem - to fix that:

export MAVEN_OPTS=-Xmx512m

Having got it built it can be run:

java -jar tika-app-0.5.jar -t EvacuationGuide.docx > EvacuationGuide.tika.docx

So do they work?

I used this Emergency Evacuation Planning Guide as my test document - I made PDF and docx versions using Office 2008.

Well, I'm not going to write up a painstakingly detailed analysis of the results. Just a few observations.

The quick and dirty script, whilst getting all the text out, was rubbish. There were a lot of words with line breaks in the middle of them. This is no good.

xpdf does better PDF conversion than PDFBox:

  1. xpdf has nice paragraphs, PDFBox has unneeded line breaks
  2. PDFBox has some kind of character encoding problem which results in O's with lines above them instead of fancy double quotes - infact xpdf was the only one to cope with them in the "test"
  3. xpdf has nicely spaced and separated paragraphs to PDFBox's crunched up ones

Both the POI conversions (doc and docx) came out kinda messy - with extra stuff to denote things it can't represent in plain text, e.g. MERGEFORMATINET or INCLUDEPICTURE and not so nicely separated paragraphs. It also had encoding problems with the fancy double quotes.

The Tika conversions were, as you'd expect, identical to the PDFBox and POI conversions. It does provide a very useful service to wrap them up though, and it offers a lot more formats that just PDF and Word.

Overall though

All of them (except the quick and dirty script) perform well enough to extract meaningful text and so should be suitable for indexing purposes.

I intend to use Apache Tiki for it's ease of use.