Thinking differently about XML

Volume 11, Issue 55; 04 Aug 2008; last modified 08 Oct 2010

Having an XML server at my disposal is making me think about XML applications differently.

I've been writing XML applications for a long time. Arguably, since we spelled XML “es” “gee” “em” “el”. In those years, I'd grown to think of XML applications as being primarily operations on some principle document: a book, a web page, an Atom feed, what have you. (I'm not saying all XML applications are like this, I'm just saying this is how I tended to think about them.)

Individual documents were sometimes composed from several files (via entities or XInclude) and some applications operated on a small number of files, but there was always at least some logical sense in which there was “the main file” and its ancillary files.

When my application involved a potentially large number of files, I usually massaged them into a single file and used that as one of my small number of files. All of the many and varied sources of information used to present essays in this weblog, for example, are aggregated into a honking big RDF/XML document and that document is used as an ancillary resource when formatting the XML for each essay, the “main file”.

One of demos I constructed to learn more about Mark Logic Server was a “W3C Spec Explorer”.

I took all of the W3C specs and poured them into the server then I set out to write some XQuery code that would allow me to view the specifications by date, by working group, by editor, and through full-text search (or any combination of those options simultaneously).

My starting point for building the sort of faceted navigation that I had in mind was the RDF metadata that the W3C provides. (Not only was I interested in having better full-text search of the specs for myself, I was also interested in exploring RDF in the server.)

In the course of learning how best to build this application, I posted a question on the internal “discuss” list asking some fairly basic questions about how to efficiently search RDF's odd serialization. One of the replies that I got suggested (clearly, concisely, and patiently), that I was thinking about the problem from the wrong end. I had gone out and asked the server to give me a fairly big document and now I was trying to grub around inside it to find stuff. Instead, I should “push the constraints to the database”. Don't just ask the server (the database) for the file, ask it for the actual elements I care about.

This did two things: first, it made my searches instantaneous or so nearly so as to make no difference. Second, it made me start to think very differently about XML applications.

The server's “universal index” over all the content in the database makes it practical (often blindingly fast) to ask questions over an enormous number of documents.

Want all the rdf:Description elements of type “REC”, just ask for them: //rdf:Description[rdf:type/@rdf:resource="…#REC"]. That's not all of the descendants of some document, that's all of them anywhere in the database.

In fact, in my application, I broke the big RDF document up into a bunch of documents, one for each rdf:Description so that I could ask for /rdf:Description, rather than looking at all descendants anywhere. Had I needed to, I could have limited the search to a particular collection or used any of a number of other mechanisms for making it very targeted.

Maybe I'm just discovering something that was obvious to all of you, but I'm now thinking of XML applications over not just a few files, but a whole database. My world is suddenly a lot bigger which is very cool.

There's a cool footnote to this essay too (though I'm not actually choosing to make it a footnote, but nevermind). The guy doing all the patient explaining was Dave Kellogg , our CEO. Bonus points for a paradigm shifting answer.

That our CEO took the time to read and answer a mundane technical question from a newbie on an internal list set the stage for something that was really driven home to me a couple of weeks ago at my first ever “semi-annual kickoff meeting”: this company is full of excellent people. Every single one of them, as far as I can tell.

I'm having a ball.

Comments

Norm, is it just me, or have you explained the abstraction from a database to a more general XML usage?

Is this just a database thing, or is it applicable more generally? XSLT|XQuery wise //x is frowned upon as wasteful. Is the 'universal index' the key to this, making it more usable?

regards DaveP

—Posted by Dave Pawson on 05 Aug 2008 @ 08:03 UTC #

Because Mark Logic Server was built from the ground up to operate on XML, its indexing strategies are designed explicitly to work well with the full richness of mixed content. It's definitely the universal index that makes it possible for the server to provide immediate answers to a large number of query expressions (and different kinds of query expressions simultaneously) that would otherwise be slow to compute.

Of course, I don't mean to make it sound magical. It's possible to devise query expressions that ask questions that don't have answers in the index and those can't be answered instantly. Luckily, there are a whole bunch of dials and knobs on the server, so it's often possible to tailor the index to suit the queries you want to ask.

TANSTAAFL, the more things you index, the more space the database occupies. But compressed XML + indexes turns out not to be that much larger than the original XML.

—Posted by Norman Walsh on 05 Aug 2008 @ 11:54 UTC #

Norm,

Any chance to have access to this application ?

MoZ

—Posted by MohamedZergaoui on 06 Aug 2008 @ 01:38 UTC #

Not just at the moment. But maybe later in the year after the next version of the server officially ships.

—Posted by Norman Walsh on 06 Aug 2008 @ 05:05 UTC #

Hi Norm, do you have a specified subset of the RDF/XML syntax that you use to store RDF in the XML database, to avoid the ambiguities and variations? And even if so, I suspect the XPath expressions to traverse RDF would still get hairy. You worked on RDF twig for this very reason. I wonder, does Mark Logic Server have a better way of handling RDF? I guess, optimally that would be SPARQL... 8-)

—Posted by Jacek Kopecky on 12 Aug 2008 @ 11:14 UTC #

Parsing RDF in XSLT 2.0 or XQuery isn't too bad. Having user-defined functions and being able to call them in XPath expressions makes most of RDFtwig unnecessary.

I didn't do anything fancy, I just flattened all resources to single, separate nodes: no striping.

If you look at that through an XML lens and write some helper functions to work your way through bnodes, it's mostly ok.

—Posted by Norman Walsh on 12 Aug 2008 @ 07:17 UTC #

I guess XPath 2.0 is pretty powerful, especially with preprocessing like you propose. I once created such a preprocessing (RDFXSLT flattening) step in XSLT, but I didn't have XPath 2.0 for the next step to make it really nice...

Anyway, I wanted to mention that there's this little thing that you might be interested in: it's called XSparql and it's a fairly straightforward combination of XQuery and SPARQL, so one can in a single query combine data from RDF (using graph patterns) and from XML, and output either XML or RDF (using SPARQL CONSTRUCT). (The link leads to a tech report and an online prototype, I'm among the authors.)

With your serious experience in XML and RDF, do you think such a combined language makes sense? Do you think it would be useful (no commitment from your employer necessary or even implied) to extend MarkLogic server with RDF capabilities, and if so, would a language like XSPARQL be a good way to access it?

—Posted by Jacek Kopecky on 12 Aug 2008 @ 09:52 UTC #

Norm: I'm curious whether this all has led you to any new insights regarding XML Schema Languages and/or XML Query languages? Thanks.

—Posted by Noah Mendelsohn on 18 Aug 2008 @ 03:30 UTC #