Why Refactor DocBook?

Volume 6, Issue 38; 16 Jun 2003

More thoughts on refactoring.

Michael Smith asked me to clarify the motivations I have for wanting to refactor DocBook. I did so on the list, but I'm putting my ideas online here as well for consistency.

These are some further thoughts on why I think now is the time to refactor DocBook. Apologies, in advance, if some of these issues have already been discussed on the list recently. I haven't caught up yet. I wrote these thoughts while I was disconnected on the plane ride home.

  1. The single most compelling reason, the reason that I think would be sufficient if it was the only reason, is that DocBook has become brittle. It has grown, slowly and reasonably conservatively but continuously, for many years. Changes that were each individually small and well conceived form quite a tenuous pile when taken all together. Look at the number of class and mixture parameter entities we now have. Many are very similar but not the same. Can you tell from inspection why they aren't the same? Is the organizing principle that created them discernable? I don't think so. As the current maintainer, I'm aware that this is my fault to one degree or another.

    Whatever the cause, and irrespective of whether or not it was avoidable, we've reached the point where my software engineering experience suggests that attempts to continue on a path of accumulating patches is not practical.

  2. DocBook was conceived, designed, and built within the limiting framework of SGML and then XML DTDs. In some ways it stands as a testament to just how much you could do with those technologies. But they are hardly modern.

    For a project as large and important (if one measures importance in terms of number of users or amount of legacy, at least) as DocBook, I think novelty for novelty's sake would be a very bad idea indeed. In fact, if all things were equal, I don't think it would be inappropriate for DocBook to lag behind the technology curve. It needs to be stable and reliable.

    But all things are not equal. I think we've passed a complexity threshold beyond which the parameter entity mechanisms available in DTDs are simply not up to the task of supporting further development. I am not, and have never intended to, suggest that DocBook shouldn't be available as a DTD for many years to come, I just don't think that the DTD should be the “source format”, the format upon which further development and customization is based.

  3. Engineering advances do not proceed smoothly and uniformly over time. Instead, they proceed in fits and starts, with watershed events spuring periods of rapid development. I think RELAX NG is a watershed event in markup languages.

    DocBook hasn't suddenly become unmanageable because we added one more tag. The development of DocBook has been straining the bounds of DTD development for some time. I have been thinking about how to make progress, about how to perform a refactoring (although I'm not sure I was consciously aware that that was what I was considering) for several years. The famous “PE reorganization” RFE has existed for at least five years. I've considered, and even prototyped, several possible approaches.

    RELAX NG changes the validation model just a little bit. It removes some restrictions and allows us to think about validation in a different way. Suddenly I see a clear path forward, a way to build a much simpler, more coherent, more easily customizable DocBook framework.

    Now, at the moment, I have only a vision, and a few sketchy prototypes. I don't have enough running code to be certain my ideas will work. But I feel pretty confident.

  4. Tools exist (thank you again, James) that will allow us to continue to support existing tools and applications even as we move forward. If moving to RELAX NG required us to turn our back on every DTD-based XML tool that processes DocBook, the very idea of doing it would be very much D.O.A.Dead On Arrival.

    My vision for the intermediate future is one where DocBook is maintained in RELAX NG and where customization layers (both extensions and subsets) are devised at the RELAX NG level. But DTDs are still provided by translating the RELAX NG grammars with Trang.

    It is likely to be the case that the DTDs will not validate precisely the same documents as the RELAX NG grammar. The extent to which there is variation will depend on part upon how we design DocBook, but I don't think perfect fidelity should be a goal.

    If perfect fidelity isn't possible, why bother? Because even a slightly less constrained schema can still be used to drive editing tools like Emacs and Epic. And it will allow all the existing DTD-based tools to continue to offer some level of validation. (They'll be able to find simple typos, for example, even if they can't enforce every constraint.)

  5. DocBook needs to be able to adapt to a changing world. I've already found several occasions, for example, in which it would have been convenient for DocBook to have been in a namespace. I can imagine scenarious where it would be almost necessary. No matter what you think about namespaces, I think they're here to stay. I don't see any long term viability to an attitude of refusing to use them, at least judiciously.

  6. I think similar arguments can be made for the judicious use of simple data types, although I'm by no means certain of that. I can imagine, for example, that there might be value in validating that the content of the date element is, in fact, a date. And even more potential value in being able to sort dates and other simple values “correctly”.

  7. I think DocBook is a world leader in its class. I think there's an opportunity here to continue that leadership role and I think we should take that opportunity. We should reinvent DocBook for the modern markup world.

    I don't think anything I'm suggesting is radical. I don't propose that we invent something that's going to be maliciously (or capriciously) incompatible with the current needs or even the current markup of existing users.

    It's just time to refactor. I think that's a natural part of the life cycle of an software system that's in the middle of its productive lifespan.