ePUB specifications

Volume 13, Issue 21; 07 Jun 2010; last modified 08 Oct 2010

Playing with ePUB. In this episode, ePUB versions of W3C specifications. [Update 10 June 2010] Regenerated with stricter compliance to the ePUB rules; added a few more. Plus pretty covers!

[Update 10 June 2010] My initial attempts were pretty bad. The epubcheck tool had a field day. I believe I've fixed those problems (except for a couple that are markup problems in the originals). I've also incorporated some cool new covers contributed by Stephane Curzi. Thanks, Stephane!

Last Thursday or Friday, I got to thinking about ePUB. I have a few ePUB books that I got from O'Reilly and I wanted a way to view them on my desktop. I didn't like any of the clients I found, so over a couple of evenings and a weekend afternoon, I cooked up my own library management/reader application on top of MarkLogic Server. (I published a few screenshots if you want a sneak peek; more about that project later.)

As soon as I had a few books in there, I knew I wanted more content. Specifically, I wanted the W3C specifications that I refer to frequently. Not only is the ePUB navigation quite nice, but when I get the search features built, having the specs in there will rock!

I think the W3C's decision to have a single, normative HTML format for each specification, with machine checkable rules for the structure of that specification, is exceptionally valuable.

I know that all of the XML Query and XSLT specifications are generated from more or less the same schemas and stylesheets, so they have a very consistent style. With a little reverse engineering from my existing ePUB files and a bit of XProc and XSLT magic, I was able to generate ePUB versions of the XSLT/XQuery specifications. (That pipeline is the subject of another essay.)

Having all the XSLT/XQuery specs is nice, but not having XProc seemed kind of lame, so I tweaked things a bit to make the XProc spec work. Then I tried a few others. I remarkably large number of the specs I care about converted just fine:

XInclude
XLink 1.1
XML Namespaces
XML Namespaces 1.1
XML
XML 1.1
XML Base
XPath/XQuery Data Model
XPath/XQuery Full Text
XPath/XQuery Functions
XPath 2.0
XProc, or alternatively, chunked on third-level sections.
XPath/XQuery Formal Semantics
XQuery
XQueryX
XSL FO 1.1
XSLT/XQuery Serialization
XSLT 2.0
XSLT 2.1

I even managed a couple of OASIS specs as well with some hand editing of the metadata.

There are some that are obviously missing, like the W3C XML Schema specs. I might try to improve my script to handle a few more cases, but there's a sense of diminishing returns. Specs written in hand-authored HTML, for example, are likely to be a good deal less uniform.

Anyway, if you're interested in ePUB and in the above web specifications, please give them a try and let me know what you think.

[Update 9 June 2010] Turns out the XML Schema specs converted just fine after I fixed a simple bug. And here's the XForms 1.1 spec too, since it went through effortlessly as well.

[Update 10 June 2010] A few more.

RELAX NG Compact Syntax
RELAX NG DTD Compatibility
Guidelines for using W3C XML Schema Datatypes in with RELAX NG
XML Schema Part 1, chunked on third level sections
XML Schema Part 2, chunked on third level sections

Comments

Thank you for sharing these ePubs.

I'd strongly suggest that you use the open-source epubcheck validator (release 1.0.5 or later) to try to ensure that the ePubs you're producing follow the machine-enforceable elements of the ePub specification: http://code.google.com/p/epubcheck/

From looking at the epubcheck results, it looks like you're falling into a few common traps:

* The mimetype MUST be the first file in the archive (you're not ensuring that) and MUST be STORed rather than compressed (you are doing that)

* ePub (OPF) content MUST be valid against either DTBook or XHTML 1.1 (Strip @name, @width, and @clear and you'll be close)

* The stylesheet for your cover.html uses a fully-qualified URL rather than a relative file reference

It's come to my attention that I'm not actually resolving all the external references correctly. I'll try to fix that.

Thanks, Keith. I'll try to fix these errors as quickly as I can.

Curious as to why the RelaxNG spec made the cut, but not DocBook v5.0 or DocBook Publishers specs?! ;-)

Nice work, BTW. Cool stuff!

I'm not sure if this is the reason for Norm not converting the Publisher's spec, but when I pointed the tool to the spec on OASIS, I got only a cover.html file and no other content. I suspect there's some magic needed for OASIS specs, since I was able to generate (minus covers, since I'm not on a mac) the w3c xproc spec with no problem.

Norm, how did you process the OASIS specs? Did it take some extra steps?

The XML 1.0 Standard (5th Edition)The XPath 2.0 Standard

Although the main part of the conversion works well (we used XHTML -> LaTeX with a custom Ruby-based converter) we found that quite a lot of hand-tweaking was needed to get high-quality final output. For example, rotating very wide tables and formal grammars in the XPath standard to landscape, and dealing with other line-breaking issues.

I don't know how well these parts of an EPUB file would render on current devices with smaller screen sizes, but with a larger display it would be less of an issue.

I also like to have W3C specifications in ePub format and indeed the consistent layout the W3C uses is an advantage. For the conversion I use the following process:

- Open the spec with FireFox and save it to the desktop - Import the result in OpenOffice.org and export it to ODT

- Open the result again in OpenOffice.org

- Modify the styles "Heading 2" and "Heading 3" to give them a page-break before, which produces segments that are manageable by e-readers.

- Export to ePub with ODFToEPub.

Here are some examples:

XPath 2.0

XSLT 2.0

Here is also an example of an OASIS spec, which was produced directly from the ODT version:

ODF 1.1

Werner.