Wiki editing with XProc
An example, for better or worse, of automating website interaction with XProc.
What happened was, the DocBook wiki broke. I don't know how or why, but it fell over. The problem, whatever it is, left the wiki immutable and the underlying database in a state of questionable consistency.
Clearly a problem that had to be fixed. I setup a new wiki, running MoinMoin 1.9.2 instead of 1.3.4 [Upgrade much? -ed].
In theory, there's an upgrade path from 1.3.4 to 1.9.2 but I'm sufficiently unsure about the state of the current database that I'm loathe to use it. The last thing I want to do is put the new wiki into some indeterminate state. Instead, I grabbed all the most recent pages from the old wiki, trimmed out a bunch of cruft, and cleaned up the markup a bit (the wiki markup seems to have changed over time).
What I really wanted to do was add all these pages to the new wiki. Easy enough to do with a browser for one or two pages, but several hundred pages was way more than my patience would tolerate.
A quick experiment with HTTP Scoop made it it look pretty easy:
-
Logging in sets a cookie.
-
Loading a page that doesn't exist provides a link that you can follow to create the page.
-
Following that link returns an HTML page containing a form with a place to type the wiki markup and a bunch of hidden fields.
-
Posting that form back to the server updates the page.
If only I had a tool that could make HTTP requests and process the results…wait, wait, I have one of those!
XProc ought to be up to this job, yes? Yes! In fact, it was reasonably straightfoward. Wanna see how it works? Of course you do. The following pipeline works in XML Calabash version 0.9.20 or later.
I decided to pass the wiki markup as an input and the page name as an option. From the option, I construct the value of the URI for the page.
<p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc" name="main"
xmlns:cx="http://xmlcalabash.com/ns/extensions"
xmlns:c="http://www.w3.org/ns/xproc-step"
xmlns:html="http://www.w3.org/1999/xhtml">
<p:input port="source"/>
<p:output port="result"/>
<p:option name="page" required="true"/>
<p:variable name="pageuri" select="concat('http://wiki.example.com/',$page)"/>
Next I have to login:
<p:www-form-urlencode match="/c:request/c:body/text()">
<p:input port="source">
<p:inline>
<c:request method="POST"
href="http://wiki.example.com/DocBookWikiWelcome">
<c:body content-type="application/x-www-form-urlencoded">@@HERE@@</c:body>
</c:request>
</p:inline>
</p:input>
<p:input port="parameters">
<p:inline>
<c:param-set>
<c:param name="action" value="login"/>
<c:param name="name" value="NormanWalsh"/>
<c:param name="password" value="MYPASSWORD"/>
<c:param name="login" value="Login"/>
</c:param-set>
</p:inline>
</p:input>
</p:www-form-urlencode>
<p:http-request cx:cookies="login" name="login"/>
<p:sink/>
I reverse engineered the way the login form works. I URL encode
and pass my username, password, and other parameters to a p:http-request
that POSTS them to the server.
I don't care about the result, so I drop it on the floor with
p:sink
.
I do care about cookies, so I have to store those somewhere. XML
Calabash has an extension that lets you manage cookies in named sets.
This p:http-request
saves any cookies that come back in the
“login
” set.
Next, we have to get the page we want to edit.
<p:string-replace match="/c:request/@href" cx:depends-on="login">
<p:input port="source">
<p:inline>
<c:request method="GET" href="@@HERE@@"/>
</p:inline>
</p:input>
<p:with-option name="replace" select="concat('"', $pageuri, '"')"/>
</p:string-replace>
<p:http-request cx:cookies="login" name="getpage"/>
<p:sink/>
I use the
“login
” cookies so that the wiki knows who I am.
I also use the cx:depends-on
attribute to tell
the processor that this step depends on the preceding login step, even though
there's no dependency in the flow graph. Without this explicit statement about
dependency, the processor might attempt to get the page before performing
the login step.
Once again, I don't care about the output so I drop it on the floor. In theory, I have to parse the output and find the “edit” link. In practice, I know how to create it without looking for it in the markup. I'm not even sure I have to do this step, but it is what a browser does and it was easy to do so I left it in.
Now we want to get the page that includes the edit form:
<p:string-replace match="/c:request/@href" cx:depends-on="getpage">
<p:input port="source">
<p:inline>
<c:request method="GET" detailed="false" href="@@HERE@@"/>
</p:inline>
</p:input>
<p:with-option name="replace" select="concat('"', $pageuri, '?action=edit"')"/>
</p:string-replace>
<p:http-request cx:cookies="login" name="getpageedit"/>
Again, we use the login cookies. And this time we don't drop the output on the floor because we have to extract the hidden fields from the page in order for our subsequent POST to work.
<p:unescape-markup namespace="http://www.w3.org/1999/xhtml"
content-type="text/html" name="unescape"/>
<p:for-each name="for-each">
<p:iteration-source select="//html:input[@type='hidden']"/>
<p:output port="result"/>
<p:string-replace match="c:param/@name">
<p:input port="source">
<p:inline><c:param name="name" value="value"/></p:inline>
</p:input>
<p:with-option name="replace" select="concat('"',/*/@name,'"')"/>
</p:string-replace>
<p:string-replace match="c:param/@value">
<p:with-option name="replace" select="concat('"',/*/@value, '"')">
<p:pipe step="for-each" port="current"/>
</p:with-option>
</p:string-replace>
</p:for-each>
To get the hidden fields, we unescape the markup. XML Calabash uses
TagSoup
for “text/html
” pages, so we'll get well-formed XML.
The p:for-each
loop selects each of the
hidden input fields and transforms them into c:param
elements. We'll need those later.
Next, we have to construct the c:param
for the
“savetext
” parameter that contains our wiki markup.
This one's a bit tricky.
<p:string-replace name="savetext" match="/c:param/@value">
<p:input port="source">
<p:inline>
<c:param name="savetext" value="@@HERE@@"/>
</p:inline>
</p:input>
<p:with-option name="replace" select='concat("'",replace(c:data,"'","''"),"'")'>
<p:pipe step="main" port="source"/>
</p:with-option>
</p:string-replace>
What the hell, I hear you ask, is up with that “replace
”
value?
Well, see, what's going to appear on the source
input
port of our pipeline is a c:data
element that contains the wiki
markup of the page. The replace
option is interpolated
as an XPath expression, so we have to “quote” the value. This is a common
idiom in p:string-replace
So common that I regret
not providing some sort of syntactic shortcut for it. Oh, well, there's always
version 1.1.. Except, in this case, the
value may contain both double and single quotes, so we need
to make sure that they don't result in an invalid XPath
expression!
Imagine that this is our c:data
element:
<c:data>"Hello 'world'"</c:data>
If we do the usual quoting trick, the resulting XPath expression will be:
'"Hello 'world'"'
and that's not a syntactically valid XPath string value. So we use
replace
to double-up the apostrophes. That gives us
'"Hello ''world''"'
which is what we want. That took me a minute or two, believe you me.
Next we wrap all our c:param
elements in a c:param-set
,
construct a c:request
to hold them, and use p:www-form-urlencode
to encode them.
<p:wrap-sequence name="wrap" wrapper="c:param-set">
<p:input port="source">
<p:pipe step="for-each" port="result"/>
<p:pipe step="savetext" port="result"/>
</p:input>
</p:wrap-sequence>
<p:string-replace match="/c:request/@href" cx:depends-on="wrap">
<p:input port="source">
<p:inline>
<c:request method="POST" detailed="true" href="@@HERE@@">
<c:body content-type="application/x-www-form-urlencoded">@@HERE@@</c:body>
</c:request>
</p:inline>
</p:input>
<p:with-option name="replace" select="concat('"', $pageuri, '"')"/>
</p:string-replace>
<p:www-form-urlencode match="/c:request/c:body/text()">
<p:input port="parameters">
<p:pipe step="wrap" port="result"/>
</p:input>
</p:www-form-urlencode>
Send that off to the server and we're done!
<p:http-request cx:cookies="login"/>
<p:delete match="/c:response/*"/>
</p:declare-step>
I display the result, after deleting its contents, just to make sure that I got a 200 back.
That little XProc script got all the pages loaded in just a couple of minutes. FTW!
If you're interested, the whole script is available.
Comments
That sounds really neat Norm, and almost validation for the work you've put into xproc! You certainly sound pleased with it!
Hi Norm,
The first thing jumped out of my head is to write a bunch of throw-away php/curl scripts, and this post sparked my interest in picking up XProc. Thanks for sharing it.
This may be later in your blog, but i was curious if you had an xsl or how you managed to get docbook xml files to a wiki format? We are trying to do that, but i want to see if someone has already done this heavy lifting?
Thanks,
Russ
No. It remains perennially on my todo list. The MoinMoin folks have done a Wiki-to-DocBook transformation, so that might be a good place to start.