Friday, 12 September 2008

Generating PDF from XML with XSL-FO

One of the healthcare solutions I'm working on has to generate diagnostic report documents in a form we can distribute directly (e.g. secure email) to clinical staff. The diagnostic report text plus the usual patient demographics and such, arrives in a custom XML message from the integration engine. The solution uses HL7 v2 elsewhere, but the PAS messaging is exclusively via custom XML (not HL7 v3 XML).

The legacy document-generation subsystem used Word to create the document: a template was created to get the layout and formatting right, and simple placeholders used to mark the locations of the data we would substitute, taken from the XML data. This approach actually loads Word into memory (on the server!) to do the work - it's slow, memory-hungry and just generally clumsy and ugly. Plus you end up with a Word document, so the clinician needs Word (or the reader) to view it.

PDF is a more acceptable format, in my view. We can secure and digitally sign the document when we generate it to prevent subsequent changes. Recipients can view PDF on any platform with a free viewer. The problem for me was how to generate the PDF programmatically, from the XML data. There are probably several ways to do this, but I chose XSL-FO and the Apache FOP project mainly because I wanted to avoid using a proprietary PDF generator product (there are lots out there), but also since XSL-FO can do more than just generate PDF.

First problem: how do you create the FO which is the 'shape' of the generated document? Of course, you could simply read through all the manuals and write one from scratch. Well, I'm just plain lazy, you see, and I don't want to do all that. I want to take my nice OpenOffice document, or even Word document, and have a tool create an XSL-FO for that document. And I'd like the tool to be free (there are commercial tools of course, but I'm cheap). Does such a thing exist?

It does! Amazingly, precisely this facility exists in Abiword: not my favourite word-processor by a long, long way, but a good solution for this particular problem. OpenOffice should be really good at doing this, as it stores documents natively as XML and already uses FO internally for some style information. But, despite some promising hints, there is no mature support for this. This is a real shame: this is just the sort of thing OOo should be capable of, especially as it's apparently half way there already.

Here's what I managed to find on the OOo site:
Important to mention also that Microsoft does have an XSLT which you can apply to Word documents to generate XSL-FO. It's freely downloadable from this download page. I tried this, and it works, but the resulting FO is much messier than Abiword's.

Once you have the FO, the obvious step is to embed it in an XSL, add xsl:value-of elements in the appropriate places and use a transform to populate the template. This is the approach I took for the proof-of-concept and it worked well. The resulting PDF looks almost right - with a small amount of FO-tweaking, we should have something very usable.

But using XSL means loading up and running the (trivial) transform which I think may be very inefficient for such a simple case, plus it requires the FO to be edited. I've decided to use a simpler approach (using StringTemplate) which I hope will be more efficient, and requires less FO editing (just the addition of $fieldName$ placeholders). All we need is a list of (fieldname, XPath) pairs for our XML message, in order to drive this template.  Of course, most other applications will need the power of XSL (e.g. to deal with tables of entries): I'm only avoiding it here because the data is so simple.

This is something we're bound to want to do again, in different contexts, so I'm using this project and the prototype to build a tool-chain and utilities for this, so we can use the same approach more easily next time.


  1. Roger,

    We're currently looking at a similar project, and I have some tips. Firstly if you Google ooo2xslfo you'll find a handy jar file from System Concept which you can add as an XML filter setting to Open Office via Tools->XML Filter Settings. Once you're done you can use File->Export to generate your xsl:fo. We tested this up against Abiword (which I liked too) and MS Word and found that the Open Office produced the best results when used with Apache Cocoon to do the transformation (Abiword came a very close second).

    We're at the stage now when we're looking at repeating blocks of data, and have taken a similar placeholder approach to adding fields and eventually xsl:value-of elements. I'd be really interested to hear if you get any further with your project - I'd like to get the solution to the point where it can handle images, charts, etc from a single click within OpenOffice.


  2. Hi Roger/Matt,

    I am trying to use the XSL FO filter settings to export .doc/.odt files as .fo files. I am using OpenOffice 3.2 on Ubuntu LTE. I keep getting write error; the file could not be written. Do I miss any steps in the process? Have you had any success with the approach? Could you post the steps in a bit more detail? Appreciate your time in this regard.

    -Mahesh M