Friday, 26 January 2007

Formatting XML using Python

Update: This article contains formatting errors, resulting from the WordPress editor I think. In future, I'll write articles in the wiki and simply refer to them from in here. This article appears on this page in the wiki, with better code markup.

The current work project involves a fair bit of work with XML Schema and instance documents which are validated against these schemas. The instance documents are generated by one application, destined for consumption by another; the resulting XML is, unsurprisingly, not formatted for reading by humans. As usual with XML, you end up having to look at it occasionally in Notepad (or EditPlus, my favourite Windows workhorse editor), and then of course you want to see the document structure nicely indented (which, by the way, means that the newest EditPlus release will automatically create folds for you - very nice).

So, you locate or create a nice little script to pretty-print the XML. Perhaps the most obvious way to do this is using the identity transform, in XSLT. But as I had been writing little Python scripts to generate and transform XML on this project, I decided to write a tiny Python function to 'tidy' XML files. I've attached a complete script to this post, but the lines which actually do the work are:

dom = minidom.parse(open(inFile))
dom.writexml(open(outFile, "w"), addindent=" ", newl='\n')

The writexml method from the minidom package is where the 'pretty printing' is actually happening. If you run this script against an XML file, it will appear to work - the resulting XML is indented and formatted. However, there is a 'gotcha', which is the real point of this post.

If your XML is validated against a schema, and the schema contains an enumerated type, the nicely formatted instances of the enumerated type in the XML document are not schema-valid! Here's an example schema fragment:

<xs:simpleType name="MessageType">
<xs:restriction base="xs:string">
<xs:enumeration value="REF_INC"/>
<xs:enumeration value="REF_TRI"/>
<xs:enumeration value="REF_REJ"/>
<xs:enumeration value="REF_ACC"/>
</xs:restriction>
</xs:simpleType>

And here's a fragment of XML from a document instance:

<Header><MessageID>20070125152405435</MessageID><MessageType>REF_ACC</MessageType>
<MessageTypeVersion>0.5</MessageTypeVersion><Destinations>
<Destination>PRC</Destination></Destinations>
... etc.

Here's what writexml produced:

<Header>
<MessageID>
20070125152405435
</MessageID>
<MessageType>
REF_ACC
</MessageType>
<MessageTypeVersion>
0.5
</MessageTypeVersion>

Indented, and with newlines between elements. Unfortunately, newlines are also inserted into the text values. In this example, MessageType is no longer schema-valid: the whitespace is included in the value of this element. This is, of course, because the text values are sub-nodes of the MessageType element node. The documentation doesn't appear to offer much help, and experimenting with writexml arguments didn't result in anything better.

There's also a toprettyxml method:

prettydoc = dom.toprettyxml(indent = " ", newl = "\n")
fp = open(outFile, "w")
fp.write(prettydoc)

This is no better. In fact, I seem to end up with multiple, redundant newlines in the output. This is getting silly! All I want is nicely formatted XML, for goodness' sake! Have I missed something here? This is just not worth this much effort - the XSLT works fine.

Turns out I'm in good company - while Googling for enlightenment on the dom methods, I came across this post by Bruce Eckel. He went a good deal further than me and wrote what looks like a proper solution (though I admit I haven't checked).

There are two lessons here: (1) although I have a fondness for Python, and I do use it for scripting tasks, many corners of the library are frustratingly badly implemented and/or documented (especially the latter), and (2) you really need to understand the XML model when you're working with XML Schema. I recommend reading Eliot Rust Harold's book, Effective XML: see Item 10 (White Space Matters).