White Space

Volume 7, Issue 37; 10 Mar 2004; last modified 08 Oct 2010

If you gather any nine XML experts together and ask them a simple question about white space, odds are good that you’ll get at least ten answers.

The term “white space” occurs 32 times in Extensible Markup Language (XML) 1.1 (including one place where it is accidentally spelled “whitespace” and excluding the occurrence in the table of contents). It occurs 29 times in Extensible Markup Language (XML) 1.0 (Third Edition), including, humorously, a different single “whitespace” typo.

It is defined by a single production (number 3, “S”) that is referenced 69 times in the XML 1.1 Recommendation and 71 times in the XML 1.0 Recommendation. It is also referenced in many other specifications.

I mention white space because at the XSL Formatting Objects Task Force meeting earlier this week, I was reminded once again that if you gather any nine XML experts together and ask them a simple questionNo, I lie. There are no simple questions about white space. about white space, you’ll get ten answers. At least.

Admittedly, this is doubly true in a working group where it seems any attempt to constrain the answer to a question to a fixed number of choices is taken by the other members of the group as a challenge to come up with additional logically possible choices.

In this particular case, I attempted to constrain the answer to the question “how should a literal carriage return, #x0D, be treated in a formatting object tree?” I thought there were three possibilities: treat it like a space, treate it like a line feed, #x0A, or discard it. I was wrong. There were at least five answers, two others being: treat it as a line feed with line-feed-treatment set to preserve and treat it as an error (“halt and catch fire”).

Before we could answer this question, of course, we had to spend twenty minutes making sure we all understood the rules for numeric character references and attribute value normalization.

In fairness, it has always been this way. White space handling is even more subtle in SGML. XML has simplified the white space rules of SGML. Honest.

There is a distinctly humerous aspect to this situation, but the questions really are subtle and the answers really aren’t simple. There are two things going on simultaneously, first, the XML parser has to deal with white space in several contexts, and second, it is attempting to shield applications from some of this complexity. Applications have to answer questions about white space as well and it’s important that they get them right. The most frustrating interoperability problem in XSLT 1.0 is the fact that not all implementations treat whitespace the same way by default (that’s a nice way of saying the Microsoft implementation does it differently than everyone else which is in turn a nice way of saying that they get it wrong).

White Space Questions

There are three things that come into play when considering whitespace.

  1. Is it white space? #x20 (space) is obviously a space. #x0A (line feed) is usually considered a space, except where the line feed is significant. What about #x0D (carriage return)? Those are the easy ones. What about #x85 (next line), #x2028 (line separator), and #x2029 (paragraph separator)? What about #2002 (en space) and friends?

  2. Is it significant? Sometimes white space matters, sometimes it doesn’t. Some applications, such as XSLT, provide additional mechanisms to control where it is considered significant.

  3. Is it normalized? Put two spaces between the tokens of an attribute value that takes an enumerated list of values, and one of them goes away. End the lines in your XML document with carriage return, line feed pairs and they’re automatically replaced with single line feeds. Put an “
” in an entity reference and it’s preserved, unless that entity reference is used inside an attribute value in which case it gets normalized to a space. Unless it occurs adjacent to other white space characters, in which case all but one of them often goes away.

The XML 1.0 and XML 1.1 Recommendations give clear (but slightly different) answers to these questions. Unfortunately, XML parsers aren’t the only way to construct infosets or data models for an application, so applications still need to consider their answers.

We decided to treat #x0D like #x0A.

Comments

I was today giving XSLT training and what is funny that I used almost same "political" words when I described MSXML differences in a white space treatment.

And even more fun. Two weeks ago I fixed some code which sometimes pulled #x0D into FO output from DocBook stylesheets.

I think that XML without white space handling rules and namespaces would be too easy to teach and XML training business would fall.

—Posted by Jirka Kosek on 15 Mar 2004 @ 09:09 UTC #