Stupid conversion tricks

Volume 10, Issue 18; 03 Mar 2007; last modified 08 Oct 2010

It doesn't matter how many steps it takes as long as it's fun, right?

Start with a Word document containing names and addresses in a three-up, label-ready format.

Open that document with OpenOffice and save it as a OpenOffice Text document. Unzip that. Now you have XML.

Pretty-print content.xml. Peek at it. Fairly reasonable XML, in fact.

Craft a 25 line XSLT stylesheet to extract the names and addresses and store them in CSV format. Cheer for RegEx support in XSLT 2.0.

Fix a few irregularities in the XML with Emacs. Iterate until done.

Open the CSV file with the OpenOffice and save it as an Excel spreadsheet.

Now you have a “database” of names and addresses. I'm not quite sure why that was the desired format, but it only took about ten minutes and didn't require rekeying any data.

I call that a win.

Comments

It doesn't matter how many steps it takes as long as it's fun, right? Right.

It's funny you mention this. I build my gliding club's accounts system's manual as multiple HTML files from a single custom XML file (plus a "metadata" XML file output by the accounts program itself) via a couple of XSLT transforms. This wet Sunday afternoon's job is converting the Windows specific makefile to control this process into a somewhat more portable Ant script.

The manual didn't start as XML; in fact, it predates XML as it was originally written in Microsoft Works 2.0 in 1994. To get this in more tractable form the quickest route I could find was to open the original chapter documents in the club's old copy of Works 2000 and store them in that program's format, open the results in Word at home and store in its format then use OpenOffice to convert those to OOo format then, as you did, extract the content.xml files. (I don't have a recent version of Word but even if I had I might well have still gone via OOo to XML).

I then used a sequence of XSLT transforms (five of them, IIRC) to convert into a nice format, e.g., recognising section and subsection headers from the indent level and font. The result needed a bit of hand cleanup but not too much.

This was much more than 10 minutes work but a lot more fun than either sticking with Works 2.0 or rekeying. In particular, these were the heaviest XSLT transforms I'd written to date so I learnt quite a bit in the process. These sort of ad hoc conversion chains do have a lot to be said for them and if you do them only once it doesn't really matter how many steps are involved.

It doesn't matter how many steps it takes as long as it's fun, right? It is twice as much fun if it is reusable.

I used a similar technique to extract data item names from requirements documents written in MS Word to correlate with a business glossary maintained in Excel. I managed to get the authors of the specifications to clean up their Word documents so that the only manual editing was performed in the master documents (Word and Excel documents).

The merged information was used to publish the glossary for a data wharehousing implementation.

BTW, a large number of business users I have encountered are comfortable with using Excel as a data entry tool and I have started to build simple "MS Excel 2003 friendly" XML schemas to allow such data to be easily extracted and transformed into more structured forms (Eg. for building Schematron assertions from master reference lists).

I wonder what others define as 'fun'?

One definition I subscribe to:

The ability to get good XML out of a word|excel document?

In my case it was a spreadsheet generated by a manager on the move, but I counted it as fun!

Same route though. ODF is fun.

regards

Reminds me of how I've extracted schema for the XProc language from of the draft spec (November 17th version in XML format): With XSLT to RelaxNG Compact syntax, then little manual touches in JEdit and voilà. With learning RNC along the way, not bad experience at all :-)

Pavol, I promise we'll make the schemas more accessible next time around :-)