Monday, 10 December 2007

From C# to Java - syntax, libraries, generics...

I was disappointed to discover that the standard Java regular expression library (java.util.regex) doesn't support named captures. I'm porting some C# code over to Java, which uses this feature. The best alternative seems to be JRegex, which forces me to depend on a non-standard library.

A porting exercise like this really highlights the differences between Java and C#. I must confess that I prefer C#'s syntax and the slightly cleaner code resulting from the (mostly) very well designed and implemented set of framework libraries. The C# designers of course had two great advantages: (1) being second, and (2) not having to worry over-much about portability.

Advantage (1) is about having a 'clean slate': they could take Microsoft's Java (J++), fix (or improve) awkwardnesses, and clean up the syntax without worrying about compatibility with existing code. Property syntax is a good example; when porting over my C# code, it was a pain having to convert every property to a getXXX() / setXXX(val) method pair. On the other hand, the Java bean convention has a modest advantage when using intellisense or similar in an editor: all the get and set methods appear in a nice list - you don't have to hunt for the properties.

Advantage (2) may be more debatable. Although it's true that the Java platform APIs are careful to avoid OS-dependent behaviour, it's sometimes necessary to be aware of potential differences in the platform beneath the VM: e.g. on a Unix-like system, a file may return false for both File::isDirectory() and File::isFile() (e.g. a block-device). To be fair to .NET, the core libraries provide good support wherever they touch a common native resource such as the filesystem.

Two simple examples where the .NET framework libraries provide an out-of-the-box solution, but Java still depends on third-party support:

1) Intelligently combine path fragments to yield a single, usable pathname, using the correct separator character:
// C#
string filePath = Path.Combine(pathFrag1, pathFrag2); // fragments are strings

// Java
import org.apache.commons.io.FilenameUtils; // Requires Apache Commons IO library
...
String filePath = FilenameUtils.concat(pathFrag1, pathFrag2);

2) Read the contents of a file into a string (and ensure the file is closed):
// C#
string fileContents = File.ReadAllText(filePath);

// Java
import org.apache.commons.io.FileUtils; // Requires Apache Commons IO library
...
String fileContents = FileUtils.readFileToString(new File(filePath));

Hardly a big deal in either case, but in C# you never have to hunt for the external library and ensure it's linked. So much of the .NET BCL becomes practically 'extended syntax': if only Visual Studio implemented the equivalent of 'fix imports' in Netbeans! I do get tired of having to go to the top of the file and add the 'using ...' lines.

Generics are another matter (and much too large a subject to deal with properly here). It is a simple truth that .NET does generics properly and Java does not. If you don't like that statement, read both this article and this one, before giving me any grief about it.

Here is the same combination of generic collection and property getter, in both languages:
// C#
//
List<Segment> m_segments = new List<Segment>();
...
public Segment[] Segments
{
get { return m_segments.ToArray(); }
}

// Java
//
ArrayList<Segment> m_segments = new ArrayList<Segment>();
...
public Segment[] getSegments()
{
return m_segments.toArray(new Segment[m_segments.size()]);
}

Again, not a huge difference, but I know which I prefer to look at. The Java toArray implementation requires us to provide the new Segment array as an argument because at runtime Java's ArrayList doesn't possess the generic type information to make the copy without the type-hint. The copying is done using System.arraycopy and casts, because ArrayList's type parameter information is erased and replaced with Object (and casts) at compile-time.

Despite all of that, I'm keen to continue with both languages. If only the Java community would be prepared to go for a 'breaking change', a major step release which fixed some of the fundamental problems (e.g. generics implementation) at the expense of JVM back-compatibility, and cleaned up the syntax even further.

Tuesday, 13 November 2007

EHR: Workflow First

This short piece on the Health Data Management site resonated quite deeply with me: we discovered this truth the hard way, during the implementation of a large-scale radiology project here in the UK. A lot of time was spent on technology (RIS, PACS, PAS, HL7 messaging, DICOM integration, etc.) early on, and a lot of assumptions were made about workflow and the implementation of the operational side of the solution.

WS-*, REST and security

Via Don Box's recent post and a comment in Sam Ruby's reply, I found this presentation: worth reading. The latest version (together with a lot of other stuff the same folk have presented) can be found here.

The main message surely is to make the first gate as secure as you can: SSL + certificates. After that, if you need/want additional security then I think I agree with Don that uniformity, at least at the authentication level, is desirable. I just don't see where vested interests or additional value accrue from doing otherwise.

Wednesday, 24 October 2007

Migrating NetBeans Settings

I've been using both the Beta and nightly-build releases of NetBeans 6 for the last few weeks.  Every time I move to the next version, I have to sort out settings, preferences and libraries because the IDE doesn't offer to do this for me.  Other folk have a similar complaint, e.g.
Migrating Netbeans class libraries between versions - TEERA 2.0

NetBeans preferences are stored in your profile, which in Windows is a folder structure rooted at: C:\Documents and Settings\<yourUserName>\.netbeans\<versionNumber>


The problem with this folder structure is that it mixes IDE settings with things like the local library collection and worst of all, a local class-information cache and file history (under the var folder). The whole thing can grow to quite a size (mine's well over 80MB).


The key subfolder appears to be config. I managed to migrate all the most important (and time-consuming) settings by copying these folders from one version to the next:





  • Editors




  • Preferences




  • org-netbeans-api-project-libraries




This doesn't quite do everything – docking window positions and toolbar button sizes are obviously stored somewhere else, but I can live with this.


I've voted for this to be fixed: see http://www.netbeans.org/issues/show_bug.cgi?id=42157

Tuesday, 9 October 2007

Ronnie Hazlehurst

So sad to hear that Ronnie Hazlehurst died on 1 October.  There have been plenty of well-written obituaries (e.g. the BBC, the Guardian, and Telegraph), but I wanted to write a few lines myself because for me, like most people of my generation, his music formed an essential part of the backdrop of my early life.  I still find the tunes he wrote for all those wonderful BBC sit-coms hugely evocative: the instant I hear the opening few notes of "The Fall and Rise of Reginald Perrin" I'm transported back...

I hadn't fully appreciated the sheer brilliance of the man until I read some of the obits: for example, he used the morse-code letters for "Some Mothers do 'Ave 'Em" to dictate the rhythm, and then scored the piece for two piccolos! Every piece seemed to fit perfectly the character of the programme for which it was written.  Some might say that we will always make that judgement in retrospect, because there never was any other theme tune association.  Well, perhaps, but just listen to "The Two Ronnies", "Fall and Rise", and especially "Yes, Minister": these are works of art - you just couldn't improve on them. Those tunes will be forever special to all of us who grew up with them.

Dare Obasanjo on the Release of the Source Code of the .NET Framework Libraries

Link: Dare Obasanjo aka Carnage4Life - On the Release of the Source Code of the .NET Framework Libraries

As usual, someone else has already written it: Dare's piece on this announcement reflects my views exactly.  Since I've been playing with Java lately (see previous post) I've become used to the idea that I can jump straight into the source for almost anything, certainly for the JDK libs.

Still, most of us I'm sure would agree that this is a Good Thing, both in purely practical terms for working programmers, and on another level, more evidence that Microsoft is beginning to 'get' some of the things the other side of the industry managed to 'get' long ago.

There's another way the developer will surely benefit: being able to read through the real code is the best way to appreciate and absorb the good design principles enshrined in (most of) the .NET Framework libraries.  The principles which guided the team are described in the excellent Framework Design Guidelines book by Cwalina and Abrams, a book I'd recommend even to folk who don't use the Microsoft platform.  True, you can use Reflector to reverse the libraries, but the original source will presumably retain comments, which may reveal subtleties around intent, choices and so on.

NetBeans 6.0

After my last post on NB and Eclipse, you might think that I'd never go anywhere near NetBeans again. Well, predictably enough some things (chiefly HL7, Ruby and RDF) have dragged me back to NetBeans and Java, so I grabbed NB 6.0 Beta 1 and gave it another try.

Much, much better. Somehow, the startup time has been reduced quite a bit, everything felt faster and the whole tool is shaping up rather well. A complete contrast to my previous (and quite recent) experience. I played with the startup settings to improve performance even further (details somewhere below) and now I'm quite happy with it.

You do need to spend a bit of time with NB to appreciate just how good it actually is: the code editor features outshine Visual Studio quite easily - better refactoring support, better code navigation being the two I immediately appreciate. Simple example: want to go to a definition? Hold down CTRL and statement elements become hyperlinks. Adding libraries and references is as simple as in VS, and you can create project groups which are similar to VS solutions.  I've barely scratched the surface.

The set of plugins in the default download of NB 6.0 provides a lot of functionality, not all of it really ready for daily use, in my opinion. The UML support appeared good until I tried to use it for a substantial reverse-engineering job: it took a long time and the resulting class diagrams were slow and awkward to render and navigate.  Not really a priority for me, though.

Subversion support is also provided and this is definitely a priority for me.  Sadly, this appears to be weak, too. First, NB appears unable to import new (unversioned) projects: the 'Import into Repository...' command seems to be permanently greyed-out. Oddly, the 'Commit...' command is available even though the project folder is completely unconnected to my SVN repository.  If I invoke that, I get a partial list of new files in the grid, and the option to commit them; clicking the commit button apears to work, but after a while I see a popup dialog saying "Action canceled by user", even though I did nothing!

Ruby support is good, but I'm a novice Ruby developer and have yet to exercise all the Ruby features. The ever helpful Roman Trobl has provided some good Flash demos of Ruby support: I recommend watching the demo of NB's RoR support, where Roman builds a bare-bones blog application in a couple of minutes. I haven't found a better Ruby IDE yet.

I plan to put more information on using NetBeans on the Java section of my Wiki, especially for folk coming to Java and NetBeans from a Visual Studio background.  So far I've only added a note on the configuration settings I've adopted which improve performance considerably - more soon.

How long can this last?  Well, on the evidence of the last few days, I'm optimistic.

Wednesday, 3 October 2007

G.ho.st in the machine

G.ho.st in the machine : Blogs : BCS

Peter Murray blogs about healthcare informatics on the BCS site, but the post that I've linked to here is all about g.ho.st, an amazing piece of work all done in Flash (as far as I can see), offering a kind of VM accessible through the browser.  Thanks to Peter's post, I've signed up. I also managed to get my preferred user-name, 'roger', so I guess there aren't too many users yet.

The Flash applet does all the work of course, much like an X Window System display server does when you drive it over a network (I can remember actually doing this! It's a fundamental feature of X: how many people still exploit it?).

With g.ho.st you get 3GB(!) of space and they claim you can ftp from Windows Explorer stright to your online store.  I tried this but it didn't work for me: I tried twice to upload a couple of PDFs - Windows thought it had transferred them but they didn't show up in g.ho.st.

A bit of a toy, as it stands, but very impressive and worth keeping an eye on.

Tuesday, 2 October 2007

Facebook

Sorry, but I just don't like Facebook. Actually, you can extend that dislike to all social networking sites, though to my eternal shame I must admit to being in at least one (LinkedIn).

I think this clip very nicely sends-up the whole Facebook thing. Too much free time ...

Thursday, 27 September 2007

IONA Artix: Video Tech Brief (John Davies)

IONA Artix: Video Tech Brief

I've been looking at the Java and SOA landscape again (I expect to write more on this) and came across this interview with John Davies of Iona. I'm pointing to this video (link is above this text)  not because it's a particularly colourful performance (forgive me Mr. Davies) but because it's worth listening to what he has to say.

This is all about high-volume bank transactions which handle complex data structures (he talks mainly about Swift), exchanged and processed in XML messages. The architectures they are using for this are all based on Java ESB/SEDA platforms, on top of which Iona adds its Artix Data Services to handle metadata management and transformation services.

First, it's notable that they are building on established open-source projects - Iona has a very strong investment in open-source solutions - using Apache ActiveMQ as the basis of their FUSE enterprise messaging product.

That's all very interesting, in an industry-direction sort of a way, but it was something else (technical!) that really caught my eye (or ear): John Davies talked about the database bottleneck for these high-volume transactional systems: the messages being persisted are XML (hierarchical data) and they simply cannot accept the overhead of an ORM layer and mapping these to tables, so they're using a completely different approach, saving them as immutable BLOBs, indexing appropriately. New versions of the same message are simply stored as new objects, the original is not touched. Combine this with a massively parallel service layer and distributed store (he talks about running Gigaspaces on Azul - some 700-odd cores!) and you have a very interesting proposition.

It's in a follow-up comment on the ServerSide page where he expands a little on the problem of efficiently storing immutable, hierarchical objects and points to Subversion as one good way to accomplish this and handle versioning, where performance isn't an issue. But he also mentions ZFS, a new filesystem being developed by OpenSolaris - this offers a transactional, pooled-storage abstraction which is exactly what this sort of architecture needs.

This is fascinating stuff. Reading around this subject it's very clear that a lot of intellectual effort and investment has been poured into solving the problems of building and operating truly scalable ESB and SOA-based solutions and novel, high-performance persistence, and most (all?) of this amazing work has been done with Java, and is open-source.

As I worked my way through web pages and PDFs, I didn't find references to Microsoft's technologies - what, if anything, are they doing in this area?

Wednesday, 26 September 2007

Trying out Blogger

This is obligatory 'first post' to see that it works. I don't know whether I'll post regularly here or stick to using WordPress on my own site.

Monday, 30 July 2007

Amazon Marketplace

I've used Amazon (.co.uk) quite a bit. I've bought quite a range of stuff, from books and CDs right up to my latest digital camera. So far, I've been a very happy customer: prices have been keen, stuff has arrived quickly, been well packed and always exactly what I ordered.

So I'm feeling a bit bruised today, after discovering that the delivery charge for the two Compact Flash cards I've just ordered will be around 40% of the item price! About 9 pounds postage, on a twenty-two pound bill! These are small, light objects and they're being sent from inside the UK. Outrageous.

The Amazon webpage for the item is here.

Notice the headline price, and the fact that there is no indication of the delivery charge. Of course, I should have carefully read the subsequent pages before clicking on the 'confirm order' button, but I just didn't expect that I would need to check an Amazon order (even involving an Amazon Marketplace seller) for this sort of thing.

The marketplace seller is called _memorymegastore_ -- I'll wait to see how fast a delivery I get for my 9 pound, then leave them some feedback. I've emailed Amazon too, but don't expect much...

Caveat emptor...

Wednesday, 25 July 2007

NetBeans and Eclipse (again)

Java technology is so frustrating: I'd like to get deeper into it mainly because of projects like Mirth, but the whole developer experience feels so excruciatingly awful when compared to Visual Studio and C#. I decided to give NetBeans another go because there's a new developer release (6.0) and I found the 5.5 release reasonably good. After waiting over two and a half minutes for the damn thing to start, I remembered just how ugly it was. Even with a bit of fiddling (make the icons smaller, change the editor font, etc. etc), it still looks and feels clunky, ugly and slow. Then I did the usual smoke test - create a new standard project, build it and run it, just to be sure all the bits are in the right place. Clunk, ... grind ... whirr ... splat. What's taking so long? This is a trivial project. Eventually, it did build and it did run. And it looked awful. The UI editor (frankly, one of the better Java UI editors out there) is still nowhere near as good as Visual Studio. NetBeans is free, it has a lot of features, it can create a lot of different project types, it has UML and BPEL built-in and it has Sun behind it, but it just doesn't encourage me to persevere with it.

What about Eclipse (and Europa)? Well, the initial install is quite quick but then you have to download the Europa packages. After about 30 minutes or so, I've got the whole enchilada (easier just to get everything rather than fret about the dependencies) and I can start the thing. The startup time is less than NetBeans but still rather slow compared to VS. Now to create the 'hello world' Swing or SWT application (I don't really care which toolkit). So, File, New Project. My, but there are a lot of project types. Go for Java (there is no finer-grained optionality here). Now, how do I add a dialog class? Can't see dialog class in File/New, so I choose File / New / Other. Gulp. There are 37 groups of project item! (I counted them). Thirty seven GROUPS! This is not use-case driven; this is madness. So I choose the Java category, and only 'Class' seems appropriate.

My patience has run out (again). Both tools, for different reasons, turn me right off Java development.

Saturday, 30 June 2007

HL7 Standards and Message Mapping

For most of the last year I've been working on healthcare integration projects, involved in HL7 and related technologies. One of the biggest frustrations I've encountered is the HL7 standard itself, access to it (and related information) and the way the standard is published.

The HL7 (v2) standard was published as a body of Word documents, apparently without any machine-readable version. You might have expected that the standard would have originated as a database plus supporting commentary but it seems that the Word documents have always been the definitive standard.  You could of course write software to crawl over the documents and extract the standard but it looks to me as if this would be pretty painful; I haven't checked every document, but I'm not sure they are all consistently formatted.

Almost from the start I decide to write tools to help me with message profiles, mappings and transformations, but to do that effectively you really need to have the standard in a machine-readable form.

I did come across a German site which offered an Access DB version of the standard but this has now become an HL7.org product (I believe it was originally unaffiliated but legal pressure was brought to bear...)

The fact that HL7.org is a quasi-commercial entity irritates me: I don't mind special-interest bodies charging for 'value-added' things like books, papers and conferences, but making the standard itself proprietary and to restrict access to it just feels wrong.  Standards like this should be open to all.

What prompted this post was discovering this effort to capture HL7 segments, fields and tables in an Excel spreadsheet. Pity that the link to the file doesn't seem to work: Matthew, if you read this, please check the link in your post.

Thank goodness v3 is XML based. Presumably the specification will be driven from XSDs.  Reading the HL7 'statement of principles' (SOP) for v3 they appear to regard the v2 principle of starting from documents and deriving the technical artifacts as 'more direct', as 'one simply edits ... the appropriate word processing document'! Direct, certainly, but desirable?  The SOP indicates that a tool-chain will be used to derive documentation from 'computerized models' which sounds better. 

I was able to get access to the latest v3 standard draft, and part of this site contains the XSD but they're all linked separately, and as HTML! Why on Earth not provide a zip download of the whole thing?

Tuesday, 12 June 2007

Google Gears, Silverlight and the RIA

I use Google Docs and Google Reader all the time. With Google Gears, these are set to become even more useful because they'll work when I'm offline. The Reader already works this way (see this post for more), but not Docs + Spreadsheets yet. Rumour has it that this is coming. If they solve the sync problem properly, this will be seriously good: Docs is fine for note-taking and certainly good enough for blog posting. And of course they must extend Gears to GMail.

Web 2.0, Ajax + Gears (+ whatever server-side stuff you use to generate all this) may be good enough technology for building browser-applications which work offline, but underneath the covers it's still based on HTML, grinding out a ton of hard-to-debug JavaScript, and running inside the browser frame. For the poor developer, even though dedicated folk spend hours creating libraries like script.aculo.us (and Google's own GWT ) the experience of creating these applications remains pretty dire. And the result of all this extraordinary effort is still something which doesn't even approach the sophistication of the equivalent desktop application. As I said, I really like Google Docs, but it isn't Word, and won't ever be.

Then there's Adobe Flash and ActionScript, which has been around for a long time, is well established and has been used to create quite complex browser-hosted applications, such as Gliffy, which is (a bit) like Visio in a browser frame. Flash is cross-platform, as well as cross-browser, but still there's quite a gap between developing for Flash, and developing for the desktop. Now Adobe Labs is promoting AIR (the technology formerly known as Apollo). This looks like a warmed-over Flash, and appears to be based still on JS and HTML.

When I first heard about Silverlight, I wondered why Microsoft appeared to be tilting at the Flash/ActionScript windmill? Why would they expect developers of browser-hosted, graphical presentations to switch away from Flash? But that's not really what Silverlight is about: this is about creating a platform for Rich Internet Applications (RIAs) based on the .NET platform, which makes this potentially much more interesting, especially if development can be done in C# using Visual Studio, rather than mashing together markup and script. The runtime appears to be a a slightly reduced version of the CLR, supporting WPF/XAML as well as the traditional HTML/JS mixture. Quite remarkable, in a 1.1MB download. Rather than repeat more technical details here, I'll refer you to the Silverlight architecture summary paper.

Silverlight already runs on Mac as well as Windows, and in all the popular browsers. Could it be ported to Linux/Unix? The Mono project has proved the portability of the runtime, so it doesn't sound like such an outrageous suggestion.

Wednesday, 6 June 2007

Google Docs for Blog Posting

I'm finding Google Docs more and more useful for note-taking online, and quickly bashing together an outline document. I've just discovered that it will also post to weblog engines: this post was created in Google docs, the published straight to WordPress, via the MovableType API. This appears to work quite well, except that the document title doesn't appear to make it across to the blog posting (Update: yes it does - keep reading). You can tag the document in Google, and these tags are supposed to carry over into the blog posting.

I'll use this posting to experiment a bit. First, if I insert a picture into the document, will that be uploaded and linked in the post? I'll insert the image directly below this line:


Well, that worked! The secret is to choose the metaWeblog API instead of the MovableType API, despite the fact that Google recommends the latter: WordPress supports both, but the metaWeblog API appears to be better supported. I tried using the MovableType API initially; the posting title wasn't carried over, and tags didn't work. Both work with metaWeblog.

This really is excellent, especially because of the image embedding. Now all I want is for Google to add support for Google Gears to Google Docs and Spreadsheets, and I'll be very happy indeed.



XMI from Assemblies

I was about to start writing a tool to do this when it occurred to me that someone else must have done this already. Googling located two solutions:

  1. DOTNET2UML from AgileFactor, see this page.

  2. Xmi4DotNet, see this page.


Of these, (1) seems to be out of date and doesn't handle .NET 2.0, so it's no good to me at all. Option (2) appeals the most because it's an addin for Reflector. Trouble is, the current version of Reflector won't load it! I haven't time to work out why, but looking in the Google code issue list for this project, someone has reported this error already. Would be nice to see this fixed. I'd quite like to write code to do the reflection + generation myself, but it's just a distraction I can't afford right now.

WordPress v2.2

Upgrading successfully from 2.0.2 to 2.2 was not as straightforward as it should have been.

I followed all the instructions, disabling all the plugins and setting the theme back to the original default, but when I ran the wp-admin/upgrade.php script I got a lot of database errors.  The blog worked, but the categories were missing and the admin pages which refer to them showed errors.

These seem to be due to database permissions issues.  Googling for a solution I found a reference or two (e.g. this one), so I restored the old DB (what a good thing I did the backup!) and used phpMyAdmin to change the db-user permissions to include ALTER.  This appeared to fix the problem because the upgrade script worked perfectly.

My main gripe is that despite the (tedious, manual) DB fix, most of my posts appeared to lose their category tags.  Eventually, these appeared again, but I'm not sure exactly why.

The upgrading process is irritating. WordPress is such a fine piece of work overall, it seems such a shame that upgrading should have been be so clumsy and buggy.

Sunday, 15 April 2007

Tate Modern

We had a great family day out yesterday at the wonderful Tate Modern gallery in London. It's the last week of Carsten Höller's slide installation - great fun, whether or not you think it's art! Take a look at the website because there are plenty of photos (even videos) of the slides, including a time-lapse of their construction. The tickets were free, but you had to queue for them. We managed to get quite a few tickets, so we were able to enjoy all of the slides. The highest (from level 5) looked pretty daunting from the turbine-hall, but was actually rather a gentle experience and you can't really see out of the slide properly so you're not aware of the starting height.

Tate Modern is quite simply superb. I don't like all the exhibits (for example, there's a big Gilbert and George thing happening there right now, but I'm not a fan) but I love the building, the spaces within, the facilities and the atmosphere. And it's free. Places like Tate Modern are so important: it matters that we spend public money this way. The whole time we were there, we were happy, and all the other people we saw seemed to be, too. I felt uplifted in some way, not just because I had enjoyed myself there, but because I had shared that enjoyment with so many other people.

Thursday, 22 February 2007

Jim Gray

I can't believe that I missed the news that Jim Gray had been reported as lost-at-sea.  This is just terrible.  Jim Gray is one of the greatest minds around.  Whenever I've heard Jim speak, or read one of his papers, I've been struck by the depth of his insight and the power of his quiet, engaging style.

The last time I saw and heard him in person was at PDC2003.   He talked about high-performance computing (and distributed copies of this paper), discussed the building of very large servers (e.g. for the SkyServer project), and said something which stuck in my mind ever since.  He was talking about the problems of processing very large volumes of scientific data and building petabyte-scale computing facilities (later expressed in this paper).  He said (this is not a verbatim quote), "We've got to start thinking in terms of moving the program to the data, instead of the other way round, which is what we've been doing for years".

This is a profound idea, which is workable now that we have almost ubiquitous networking, grid-computing, and fast, cheap hardware (CPU and storage).  Network bandwidth, though improving all the time, is not increasing fast enough to keep track with the increase in data we need to analyse.

Maybe Jim is still out there on the Farallon Islands, getting away from it all.  I just wish he'd let us  know he's OK.

Tuesday, 13 February 2007

Java Futures

Have just come across two useful posts on the future of Java.  Bruce Eckel's piece is mainly focussed on web applications.  I agree wholeheartedly with his opinion that the web is "a mess", and that the technologies which have sprung up to help make the mess manageable (such as Google's Web Toolkit) are merely a convenient veneer to hide the ugliness and limitations of the underlying platform (i.e. JavaScript, HTML and the browser), which are always revealed in the end. The failure of Java applets, and in fact in any of the 'active content' technologies to solve the Rich Internet Application (RIA) problem, reflects the difficulties in solving distribution, installation and security issues.

I also enjoyed his comments about how Java was 'rushed out' to fill the gap, then subject to extensive refactoring.  The analogy with agile methods is just about admissible, especially in connection with AWT and Applets, but Eckel also notes the vitality of the Java world, and the strong, positive effect of competition on the Java / C# landscape.

Much of the rest of his post is an enthusiastic account of how Flash effectively solves the RIA problem, albeit in a vendor-dependent way. We may not like de facto standards emerging from a single commercial source, but when the technology is good enough we will find ways to compromise our principles.

The other piece is an IBM DeveloperWorks article by Elliotte Harold.  This is much more about the Java platform itself, and what we can expect to see happening to it in 2007.  With the decision to open-source the JDK, Sun has created the possibility of forks in the Java roadmap, allowing experimenters to introduce language or platform features.  This is a mixed blessing: the Java world is already a complicated place to be, and this will make it even more so.  On the other hand, this will allow a lot of talented people outside Sun to work on things like language primitives for structured data types, such as tables (for SQL integration) and trees (infosets and XML). There is a perception that Java lags behind C# in this respect (e.g. LINQ in C#).

Open-sourcing Java is good news, up to a point, but I just wish Sun would work harder to pull together the language, platform and libraries story into a more consistent whole.  There are simply too many ways to do something: for example, how many web-service frameworks are there?  I can think of three or four without trying too hard.  All different.  (Axis 1 and 2, XFire, JAX-WS, XINS, ...)

There's no doubt that C# and .NET is a simpler, cleaner and more consistent place to be than the Java world (especially the 'enterprise' Java world), but then it damn well ought to be, because it only has to live on top of one OS, and isn't subject to any sort of community process to decide its future: only Redmond (and some Microsoft Research sites around the world, notably here in Cambridge) determine the trajectory of .NET and the C# language. (Don't buy the line that the whole shebang has been ported to lots of other platforms: take a look at any of them and show me where the enterprise features are.  Yes, you can compile and run a C# program on Linux, but can you deploy and run a server-side C# component which depends on Enterprise Services, e.g. distributed transactions?  No, you can't ).

Arguably, C# and .NET owe their very existence to Sun's Java initiative.  This remains a largely unacknowledged technical debt, often ignored by Microsoft's supporters. For many of the features we take for granted, Java got there first.  It really is a cross-platform proposition including on the server, albeit with some residual, ugly cruftiness which is being weeded out gradually.

Personally, I really like C# and the .NET platform, due to the internal consistency of the runtime and C#'s cleaner syntax. However,I really want to see Java develop and evolve, partly because I like it, but largely because we benefit in all the obvious ways from a competitive environment: if one or other platform put the other out of business, innovation and improvement would be subordinated to establishing and consolidating a monopoly position.

Friday, 26 January 2007

Formatting XML using Python

Update: This article contains formatting errors, resulting from the WordPress editor I think. In future, I'll write articles in the wiki and simply refer to them from in here. This article appears on this page in the wiki, with better code markup.

The current work project involves a fair bit of work with XML Schema and instance documents which are validated against these schemas. The instance documents are generated by one application, destined for consumption by another; the resulting XML is, unsurprisingly, not formatted for reading by humans. As usual with XML, you end up having to look at it occasionally in Notepad (or EditPlus, my favourite Windows workhorse editor), and then of course you want to see the document structure nicely indented (which, by the way, means that the newest EditPlus release will automatically create folds for you - very nice).

So, you locate or create a nice little script to pretty-print the XML. Perhaps the most obvious way to do this is using the identity transform, in XSLT. But as I had been writing little Python scripts to generate and transform XML on this project, I decided to write a tiny Python function to 'tidy' XML files. I've attached a complete script to this post, but the lines which actually do the work are:

dom = minidom.parse(open(inFile))
dom.writexml(open(outFile, "w"), addindent=" ", newl='\n')

The writexml method from the minidom package is where the 'pretty printing' is actually happening. If you run this script against an XML file, it will appear to work - the resulting XML is indented and formatted. However, there is a 'gotcha', which is the real point of this post.

If your XML is validated against a schema, and the schema contains an enumerated type, the nicely formatted instances of the enumerated type in the XML document are not schema-valid! Here's an example schema fragment:

<xs:simpleType name="MessageType">
<xs:restriction base="xs:string">
<xs:enumeration value="REF_INC"/>
<xs:enumeration value="REF_TRI"/>
<xs:enumeration value="REF_REJ"/>
<xs:enumeration value="REF_ACC"/>
</xs:restriction>
</xs:simpleType>

And here's a fragment of XML from a document instance:

<Header><MessageID>20070125152405435</MessageID><MessageType>REF_ACC</MessageType>
<MessageTypeVersion>0.5</MessageTypeVersion><Destinations>
<Destination>PRC</Destination></Destinations>
... etc.

Here's what writexml produced:

<Header>
<MessageID>
20070125152405435
</MessageID>
<MessageType>
REF_ACC
</MessageType>
<MessageTypeVersion>
0.5
</MessageTypeVersion>

Indented, and with newlines between elements. Unfortunately, newlines are also inserted into the text values. In this example, MessageType is no longer schema-valid: the whitespace is included in the value of this element. This is, of course, because the text values are sub-nodes of the MessageType element node. The documentation doesn't appear to offer much help, and experimenting with writexml arguments didn't result in anything better.

There's also a toprettyxml method:

prettydoc = dom.toprettyxml(indent = " ", newl = "\n")
fp = open(outFile, "w")
fp.write(prettydoc)

This is no better. In fact, I seem to end up with multiple, redundant newlines in the output. This is getting silly! All I want is nicely formatted XML, for goodness' sake! Have I missed something here? This is just not worth this much effort - the XSLT works fine.

Turns out I'm in good company - while Googling for enlightenment on the dom methods, I came across this post by Bruce Eckel. He went a good deal further than me and wrote what looks like a proper solution (though I admit I haven't checked).

There are two lessons here: (1) although I have a fondness for Python, and I do use it for scripting tasks, many corners of the library are frustratingly badly implemented and/or documented (especially the latter), and (2) you really need to understand the XML model when you're working with XML Schema. I recommend reading Eliot Rust Harold's book, Effective XML: see Item 10 (White Space Matters).