ePUB specifications

Volume 13, Issue 21; 07 Jun 2010; last modified 08 Oct 2010

Playing with ePUB. In this episode, ePUB versions of W3C specifications. [Update 10 June 2010] Regenerated with stricter compliance to the ePUB rules; added a few more. Plus pretty covers!

[Update 10 June 2010] My initial attempts were pretty bad. The epubcheck tool had a field day. I believe I've fixed those problems (except for a couple that are markup problems in the originals). I've also incorporated some cool new covers contributed by Stephane Curzi. Thanks, Stephane!

Last Thursday or Friday, I got to thinking about ePUB. I have a few ePUB books that I got from O'Reilly and I wanted a way to view them on my desktop. I didn't like any of the clients I found, so over a couple of evenings and a weekend afternoon, I cooked up my own library management/reader application on top of MarkLogic Server. (I published a few screenshots if you want a sneak peek; more about that project later.)

As soon as I had a few books in there, I knew I wanted more content. Specifically, I wanted the W3C specifications that I refer to frequently. Not only is the ePUB navigation quite nice, but when I get the search features built, having the specs in there will rock!

I think the W3C's decision to have a single, normative HTML format for each specification, with machine checkable rules for the structure of that specification, is exceptionally valuable.

I know that all of the XML Query and XSLT specifications are generated from more or less the same schemas and stylesheets, so they have a very consistent style. With a little reverse engineering from my existing ePUB files and a bit of XProc and XSLT magic, I was able to generate ePUB versions of the XSLT/XQuery specifications. (That pipeline is the subject of another essay.)

Having all the XSLT/XQuery specs is nice, but not having XProc seemed kind of lame, so I tweaked things a bit to make the XProc spec work. Then I tried a few others. I remarkably large number of the specs I care about converted just fine:

I even managed a couple of OASIS specs as well with some hand editing of the metadata.

There are some that are obviously missing, like the W3C XML Schema specs. I might try to improve my script to handle a few more cases, but there's a sense of diminishing returns. Specs written in hand-authored HTML, for example, are likely to be a good deal less uniform.

Anyway, if you're interested in ePUB and in the above web specifications, please give them a try and let me know what you think.

[Update 9 June 2010] Turns out the XML Schema specs converted just fine after I fixed a simple bug. And here's the XForms 1.1 spec too, since it went through effortlessly as well.

[Update 10 June 2010] A few more.


Thank you for sharing these ePubs.

I'd strongly suggest that you use the open-source epubcheck validator (release 1.0.5 or later) to try to ensure that the ePubs you're producing follow the machine-enforceable elements of the ePub specification: http://code.google.com/p/epubcheck/

From looking at the epubcheck results, it looks like you're falling into a few common traps:

* The mimetype MUST be the first file in the archive (you're not ensuring that) and MUST be STORed rather than compressed (you are doing that)

* ePub (OPF) content MUST be valid against either DTBook or XHTML 1.1 (Strip @name, @width, and @clear and you'll be close)

* The stylesheet for your cover.html uses a fully-qualified URL rather than a relative file reference

—Posted by Keith Fahlgren on 09 Jun 2010 @ 01:36 UTC #

It's come to my attention that I'm not actually resolving all the external references correctly. I'll try to fix that.

—Posted by Norman Walsh on 09 Jun 2010 @ 04:07 UTC #

Thanks, Keith. I'll try to fix these errors as quickly as I can.

—Posted by Norman Walsh on 09 Jun 2010 @ 04:08 UTC #

Curious as to why the RelaxNG spec made the cut, but not DocBook v5.0 or DocBook Publishers specs?! ;-)

Nice work, BTW. Cool stuff!

—Posted by Scott Hudson on 10 Jun 2010 @ 07:44 UTC #

I'm not sure if this is the reason for Norm not converting the Publisher's spec, but when I pointed the tool to the spec on OASIS, I got only a cover.html file and no other content. I suspect there's some magic needed for OASIS specs, since I was able to generate (minus covers, since I'm not on a mac) the w3c xproc spec with no problem.

Norm, how did you process the OASIS specs? Did it take some extra steps?

—Posted by Dick Hamilton on 15 Jun 2010 @ 09:23 UTC #
We've recently been making conversions of W3C standards to print (as 6"x9" paperback books) -- e.g. The XML 1.0 Standard (5th Edition) and The XPath 2.0 Standard.

Although the main part of the conversion works well (we used XHTML -> LaTeX with a custom Ruby-based converter) we found that quite a lot of hand-tweaking was needed to get high-quality final output. For example, rotating very wide tables and formal grammars in the XPath standard to landscape, and dealing with other line-breaking issues.

I don't know how well these parts of an EPUB file would render on current devices with smaller screen sizes, but with a larger display it would be less of an issue.

—Posted by Brian Gough on 13 Jul 2010 @ 12:01 UTC #

I also like to have W3C specifications in ePub format and indeed the consistent layout the W3C uses is an advantage. For the conversion I use the following process:

- Open the spec with FireFox and save it to the desktop - Import the result in OpenOffice.org and export it to ODT

- Open the result again in OpenOffice.org

- Modify the styles "Heading 2" and "Heading 3" to give them a page-break before, which produces segments that are manageable by e-readers.

- Export to ePub with ODFToEPub.

Here are some examples:

XPath 2.0

XSLT 2.0

Here is also an example of an OASIS spec, which was produced directly from the ODT version:

ODF 1.1


—Posted by Werner Donné on 05 May 2011 @ 07:09 UTC #