An example, for better or worse, of automating website interaction with XProc.

What happened was, the DocBook wiki broke. I don't know how or why, but it fell over. The problem, whatever it is, left the wiki immutable and the underlying database in a state of questionable consistency.

Clearly a problem that had to be fixed. I setup a new wiki, running MoinMoin 1.9.2 instead of 1.3.4 [Upgrade much? -ed].

In theory, there's an upgrade path from 1.3.4 to 1.9.2 but I'm sufficiently unsure about the state of the current database that I'm loathe to use it. The last thing I want to do is put the new wiki into some indeterminate state. Instead, I grabbed all the most recent pages from the old wiki, trimmed out a bunch of cruft, and cleaned up the markup a bit (the wiki markup seems to have changed over time).

What I really wanted to do was add all these pages to the new wiki. Easy enough to do with a browser for one or two pages, but several hundred pages was way more than my patience would tolerate.

A quick experiment with HTTP Scoop made it it look pretty easy:

If only I had a tool that could make HTTP requests and process the results…wait, wait, I have one of those!

XProc ought to be up to this job, yes? Yes! In fact, it was reasonably straightfoward. Wanna see how it works? Of course you do. The following pipeline works in XML Calabash version 0.9.20 or later.

I decided to pass the wiki markup as an input and the page name as an option. From the option, I construct the value of the URI for the page.

  1<p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc" name="main"
  2                xmlns:cx="http://xmlcalabash.com/ns/extensions"
  3                xmlns:c="http://www.w3.org/ns/xproc-step"
  4                xmlns:html="http://www.w3.org/1999/xhtml">
  5  <p:input port="source"/>
  6  <p:output port="result"/>
  7  <p:option name="page" required="true"/>
  8
  9  <p:variable name="pageuri" select="concat('http://wiki.example.com/',$page)"/>

Next I have to login:

  1  <p:www-form-urlencode match="/c:request/c:body/text()">
  2    <p:input port="source">
  3      <p:inline>
  4        <c:request method="POST"
  5                   href="http://wiki.example.com/DocBookWikiWelcome">
  6          <c:body content-type="application/x-www-form-urlencoded">@@HERE@@</c:body>
  7        </c:request>
  8      </p:inline>
  9    </p:input>
 10    <p:input port="parameters">
 11      <p:inline>
 12        <c:param-set>
 13          <c:param name="action" value="login"/>
 14          <c:param name="name" value="NormanWalsh"/>
 15          <c:param name="password" value="MYPASSWORD"/>
 16          <c:param name="login" value="Login"/>
 17        </c:param-set>
 18      </p:inline>
 19    </p:input>
 20  </p:www-form-urlencode>
 21
 22  <p:http-request cx:cookies="login" name="login"/>
 23
 24  <p:sink/>

I reverse engineered the way the login form works. I URL encode and pass my username, password, and other parameters to a <p:http-request> that POSTS them to the server.

I don't care about the result, so I drop it on the floor with <p:sink>.

I do care about cookies, so I have to store those somewhere. XML Calabash has an extension that lets you manage cookies in named sets. This <p:http-request> saves any cookies that come back in the “login” set.

Next, we have to get the page we want to edit.

  1  <p:string-replace match="/c:request/@href" cx:depends-on="login">
  2    <p:input port="source">
  3      <p:inline>
  4        <c:request method="GET" href="@@HERE@@"/>
  5      </p:inline>
  6    </p:input>
  7    <p:with-option name="replace" select="concat('&quot;', $pageuri, '&quot;')"/>
  8  </p:string-replace>
  9
 10  <p:http-request cx:cookies="login" name="getpage"/>
 11
 12  <p:sink/>

I use the “login” cookies so that the wiki knows who I am. I also use the cx:depends-on attribute to tell the processor that this step depends on the preceding login step, even though there's no dependency in the flow graph. Without this explicit statement about dependency, the processor might attempt to get the page before performing the login step.

Once again, I don't care about the output so I drop it on the floor. In theory, I have to parse the output and find the “edit” link. In practice, I know how to create it without looking for it in the markup. I'm not even sure I have to do this step, but it is what a browser does and it was easy to do so I left it in.

Now we want to get the page that includes the edit form:

  1  <p:string-replace match="/c:request/@href" cx:depends-on="getpage">
  2    <p:input port="source">
  3      <p:inline>
  4        <c:request method="GET" detailed="false" href="@@HERE@@"/>
  5      </p:inline>
  6    </p:input>
  7    <p:with-option name="replace" select="concat('&quot;', $pageuri, '?action=edit&quot;')"/>
  8  </p:string-replace>
  9
 10  <p:http-request cx:cookies="login" name="getpageedit"/>

Again, we use the login cookies. And this time we don't drop the output on the floor because we have to extract the hidden fields from the page in order for our subsequent POST to work.

  1  <p:unescape-markup namespace="http://www.w3.org/1999/xhtml"
  2                     content-type="text/html" name="unescape"/>
  3
  4  <p:for-each name="for-each">
  5    <p:iteration-source select="//html:input[@type='hidden']"/>
  6    <p:output port="result"/>
  7
  8    <p:string-replace match="c:param/@name">
  9      <p:input port="source">
 10        <p:inline><c:param name="name" value="value"/></p:inline>
 11      </p:input>
 12      <p:with-option name="replace" select="concat('&quot;',/*/@name,'&quot;')"/>
 13    </p:string-replace>
 14
 15    <p:string-replace match="c:param/@value">
 16      <p:with-option name="replace" select="concat('&quot;',/*/@value, '&quot;')">
 17        <p:pipe step="for-each" port="current"/>
 18      </p:with-option>
 19    </p:string-replace>
 20  </p:for-each>

To get the hidden fields, we unescape the markup. XML Calabash uses TagSoup for “text/html” pages, so we'll get well-formed XML.

The <p:for-each> loop selects each of the hidden input fields and transforms them into <c:param> elements. We'll need those later.

Next, we have to construct the <c:param> for the “savetext” parameter that contains our wiki markup. This one's a bit tricky.

  1  <p:string-replace name="savetext" match="/c:param/@value">
  2    <p:input port="source">
  3      <p:inline>
  4        <c:param name="savetext" value="@@HERE@@"/>
  5      </p:inline>
  6    </p:input>
  7    <p:with-option name="replace" select='concat("&apos;",replace(c:data,"&apos;","&apos;&apos;"),"&apos;")'>
  8      <p:pipe step="main" port="source"/>
  9    </p:with-option>
 10  </p:string-replace>

What the hell, I hear you ask, is up with that “replace” value?

Well, see, what's going to appear on the source input port of our pipeline is a <c:data> element that contains the wiki markup of the page. The replace option is interpolated as an XPath expression, so we have to “quote” the value. This is a common idiom in <p:string-replace>[1]. Except, in this case, the value may contain both double and single quotes, so we need to make sure that they don't result in an invalid XPath expression!

Imagine that this is our <c:data> element:

  1<c:data>"Hello 'world'"</c:data>

If we do the usual quoting trick, the resulting XPath expression will be:

  1'"Hello 'world'"'

and that's not a syntactically valid XPath string value. So we use replace to double-up the apostrophes. That gives us

  1'"Hello ''world''"'

which is what we want. That took me a minute or two, believe you me.

Next we wrap all our <c:param> elements in a <c:param-set>, construct a <c:request> to hold them, and use <p:www-form-urlencode> to encode them.

  1  <p:wrap-sequence name="wrap" wrapper="c:param-set">
  2    <p:input port="source">
  3      <p:pipe step="for-each" port="result"/>
  4      <p:pipe step="savetext" port="result"/>
  5    </p:input>
  6  </p:wrap-sequence>
  7
  8  <p:string-replace match="/c:request/@href" cx:depends-on="wrap">
  9    <p:input port="source">
 10      <p:inline>
 11        <c:request method="POST" detailed="true" href="@@HERE@@">
 12          <c:body content-type="application/x-www-form-urlencoded">@@HERE@@</c:body>
 13        </c:request>
 14      </p:inline>
 15    </p:input>
 16    <p:with-option name="replace" select="concat('&quot;', $pageuri, '&quot;')"/>
 17  </p:string-replace>
 18
 19  <p:www-form-urlencode match="/c:request/c:body/text()">
 20    <p:input port="parameters">
 21      <p:pipe step="wrap" port="result"/>
 22    </p:input>
 23  </p:www-form-urlencode>

Send that off to the server and we're done!

  1  <p:http-request cx:cookies="login"/>
  2
  3  <p:delete match="/c:response/*"/>
  4
  5</p:declare-step>

I display the result, after deleting its contents, just to make sure that I got a 200 back.

That little XProc script got all the pages loaded in just a couple of minutes. FTW!

If you're interested, the whole script is available.


[1]So common that I regret not providing some sort of syntactic shortcut for it. Oh, well, there's always version 1.1.

Comments:

That sounds really neat Norm, and almost validation for the work you've put into xproc! You certainly sound pleased with it!

Posted by Dave Pawson on 08 Mar 2010 @ 08:04am UTC #

Hi Norm,

The first thing jumped out of my head is to write a bunch of throw-away php/curl scripts, and this post sparked my interest in picking up XProc. Thanks for sharing it.

Posted by Yining on 08 Mar 2010 @ 05:19pm UTC #

This may be later in your blog, but i was curious if you had an xsl or how you managed to get docbook xml files to a wiki format? We are trying to do that, but i want to see if someone has already done this heavy lifting?

Thanks,

Russ

Posted by Russ Urquhart on 12 Aug 2010 @ 04:03pm UTC #

No. It remains perennially on my todo list. The MoinMoin folks have done a Wiki-to-DocBook transformation, so that might be a good place to start.

Posted by Norman Walsh on 12 Aug 2010 @ 05:40pm UTC #
Comments on this essay are closed. Thank you, spammers.