From the Technical Plenary, a URI that got lost: a quick “off-the-cuff” definition for XML chunk equality based on the Infoset.

At the W3C Technical Plenary in March, 2004, the XML Core Working Group and the TAG met to discuss the “XML chunk” issue.

Part of that discussion was about what it means for two chunks of XML to be equal. I banged up a quick “off-the-cuff” definition for equality based on the Infoset.

I wanted to make the definition available during the meeting so I dropped it into my “scratch space” on this site. That URI must have made it into some record of the meeting because it turns up in my logs occasionally. In the spirit of keeping URIs persistent, here is the definition that I proposed:

1. Document Information Items

Two document information items are equal if their [children]
properties are equal, ignoring processing instructions and comments.

2. Element Information Items

Two element information items are equal if the following properties
are equal:

  - [namespace name]
  - [local name]
  - [children]
  - [attributes]

Children are compared in order, attributes without respect to order.

3. Attribute Information Items

Two attribute information items are equal if the following properties
are equal:

  - [namespace name]
  - [local name]
  - [normalized value]

4. Character Information Items

Two character information items are equal if the following properties
are equal:

  - [character code]

5. Unparsed Entity Information Items

Two unparsed entity information items are equal if the following
properties are equal:

  - [name]
  - [system identifer]
  - [notation name]

It’s not a complete definition (there are a few more information items that would have to be considered), it was just written as an attempt to show that a definition based on the Infoset could be written. If that seems like a self-evident statement, well, all I can say is that it is sometimes useful at working group meetings to say explicitly things that are self-evident.

Comments:

Check out the SQL-2003 part 14 where an infoset-based sameness for XML datatypes is being defined. However, your example shows why equality is a controversial topic. I would think that for many applications, PIs and comments need to be considered for equality, but not for others...

Posted by Michael Rys on 26 May 2004 @ 01:20am UTC #
Add a comment or subscribe to (existing and future) comments on this essay.
Name:
Email*:
 *Please provide your real email address; it will not be displayed as part of the comment.
Homepage:
Comment**:
 **The following markup may be used in the body of the comment: a, abbr, b, br, code, em, i, p, pre, strong, and var. You can also use character entities. Any other markup will be discarded, including all attributes (except href on a). Your tag soup will be sanitized...