<?xml version='1.0' encoding='utf-8'?>
<?xml-stylesheet href="/style/browser.xsl" type="text/xsl"?>
<essay xmlns="http://docbook.org/ns/docbook"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
       xmlns:dc='http://purl.org/dc/elements/1.1/'
       xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:gal='http://norman.walsh.name/rdf/gallery#'
       xmlns:foaf="http://xmlns.com/foaf/0.1/"
       xml:lang="en"
       version='5.0'>
<info>
<title>Wiki editing with XProc</title>
<volumenum>13</volumenum>
<issuenum>8</issuenum>
<pubdate>2010-03-07T16:25:44-05:00</pubdate>
<author><personname>
<firstname>Norman</firstname><surname>Walsh</surname>
</personname></author>
<copyright><year>2010</year><holder>Norman Walsh</holder></copyright>
<abstract>
<para>An example, for better or worse, of automating website interaction
with XProc.</para>
</abstract>
</info>

<para xml:id='p1'>What happened was, the DocBook wiki broke. I don't
know how or why, but it fell over. The problem, whatever it is, left
the wiki immutable and the underlying database in a state of
questionable consistency.</para>

<para xml:id='p2'>Clearly a problem that had to be fixed.
I setup a new wiki, running
<link xlink:href="http://moinmo.in/">MoinMoin</link> 1.9.2 instead of
<emphasis>1.3.4</emphasis> [Upgrade much? -ed].</para>

<para xml:id='p3'>In theory, there's an upgrade path from 1.3.4 to 1.9.2 but I'm
sufficiently unsure about the state of the current database that I'm
loathe to use it. The last thing I want to do is put the
<emphasis>new</emphasis> wiki into some indeterminate state.
Instead, I grabbed all the most recent pages from the old wiki, trimmed
out a bunch of cruft, and cleaned up the markup a bit (the wiki markup
seems to have changed over time).</para>

<para xml:id='p4'>What I really wanted to do was add all these pages to the new wiki.
Easy enough to do with a browser for one or two pages, but several hundred
pages was way more than my patience would tolerate.</para>

<para xml:id='p5'>A quick experiment with
<link xlink:href="http://www.tuffcode.com/">HTTP Scoop</link> made it
it look pretty easy:</para>

<itemizedlist>
<listitem>
<para xml:id='p6'>Logging in sets a cookie.</para>
</listitem>
<listitem>
<para xml:id='p7'>Loading a page that doesn't exist provides a link that you can follow
to create the page.</para>
</listitem>
<listitem>
<para xml:id='p8'>Following that link returns an HTML page containing a form with a
place to type the wiki markup and a bunch of hidden fields.</para>
</listitem>
<listitem>
<para xml:id='p9'>Posting that form back to the server updates the page.
</para>
</listitem>
</itemizedlist>

<para xml:id='p10'>If only I had a tool that could make HTTP requests and process the
results…wait, wait, I <emphasis>have</emphasis> one of those!</para>

<para xml:id='p11'>XProc ought to be up to this job, yes? Yes!
In fact, it was reasonably straightfoward. Wanna see how it works?
Of course you do. The following pipeline works in
<link xlink:href="http://xmlcalabash.com/">XML Calabash</link> version  0.9.20 or
later.</para>

<para xml:id='p12'>I decided to pass the wiki markup as an input and the page name
as an option. From the option, I construct the value of the URI for
the page.</para>

<programlisting><![CDATA[<p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc" name="main"
                xmlns:cx="http://xmlcalabash.com/ns/extensions"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:html="http://www.w3.org/1999/xhtml">
  <p:input port="source"/>
  <p:output port="result"/>
  <p:option name="page" required="true"/>

  <p:variable name="pageuri" select="concat('http://wiki.example.com/',$page)"/>]]></programlisting>

<para xml:id='p13'>Next I have to login:</para>

<programlisting><![CDATA[  <p:www-form-urlencode match="/c:request/c:body/text()">
    <p:input port="source">
      <p:inline>
        <c:request method="POST"
                   href="http://wiki.example.com/DocBookWikiWelcome">
          <c:body content-type="application/x-www-form-urlencoded">@@HERE@@</c:body>
        </c:request>
      </p:inline>
    </p:input>
    <p:input port="parameters">
      <p:inline>
        <c:param-set>
          <c:param name="action" value="login"/>
          <c:param name="name" value="NormanWalsh"/>
          <c:param name="password" value="MYPASSWORD"/>
          <c:param name="login" value="Login"/>
        </c:param-set>
      </p:inline>
    </p:input>
  </p:www-form-urlencode>

  <p:http-request cx:cookies="login" name="login"/>

  <p:sink/>]]></programlisting>

<para xml:id='p14'>I reverse engineered the way the login form works. I URL encode
and pass my username, password, and other parameters to a <tag>p:http-request</tag>
that POSTS them to the server.</para>

<para xml:id='p15'>I don't care about the result, so I drop it on the floor with
<tag>p:sink</tag>.</para>

<para xml:id='p16'>I do care about cookies, so I have to store those somewhere. XML
Calabash has an extension that lets you manage cookies in named sets.
This <tag>p:http-request</tag> saves any cookies that come back in the
“<literal>login</literal>” set.</para>

<para xml:id='p17'>Next, we have to get the page we want to edit.</para>

<programlisting><![CDATA[  <p:string-replace match="/c:request/@href" cx:depends-on="login">
    <p:input port="source">
      <p:inline>
        <c:request method="GET" href="@@HERE@@"/>
      </p:inline>
    </p:input>
    <p:with-option name="replace" select="concat('&quot;', $pageuri, '&quot;')"/>
  </p:string-replace>

  <p:http-request cx:cookies="login" name="getpage"/>

  <p:sink/>]]></programlisting>

<para xml:id='p18'>I use the
“<literal>login</literal>” cookies so that the wiki knows who I am.
I also use the <tag class="attribute">cx:depends-on</tag> attribute to tell
the processor that this step depends on the preceding login step, even though
there's no dependency in the flow graph. Without this explicit statement about
dependency, the processor might attempt to get the page before performing
the login step.</para>

<para xml:id='p19'>Once again, I don't care about the output so I drop it on the floor.
In theory, I have to parse the output and find the “edit” link. In practice,
I know how to create it without looking for it in the markup. I'm not even
sure I have to do this step, but it is what a browser does and it was easy to do
so I left it in.</para>

<para xml:id='p20'>Now we want to get the page that includes the edit form:</para>

<programlisting><![CDATA[  <p:string-replace match="/c:request/@href" cx:depends-on="getpage">
    <p:input port="source">
      <p:inline>
        <c:request method="GET" detailed="false" href="@@HERE@@"/>
      </p:inline>
    </p:input>
    <p:with-option name="replace" select="concat('&quot;', $pageuri, '?action=edit&quot;')"/>
  </p:string-replace>

  <p:http-request cx:cookies="login" name="getpageedit"/>]]></programlisting>

<para xml:id='p21'>Again, we use the login cookies. And this time we don't drop the output
on the floor because we have to extract the hidden fields from the page
in order for our subsequent POST to work.</para>

<programlisting><![CDATA[  <p:unescape-markup namespace="http://www.w3.org/1999/xhtml"
                     content-type="text/html" name="unescape"/>

  <p:for-each name="for-each">
    <p:iteration-source select="//html:input[@type='hidden']"/>
    <p:output port="result"/>

    <p:string-replace match="c:param/@name">
      <p:input port="source">
        <p:inline><c:param name="name" value="value"/></p:inline>
      </p:input>
      <p:with-option name="replace" select="concat('&quot;',/*/@name,'&quot;')"/>
    </p:string-replace>

    <p:string-replace match="c:param/@value">
      <p:with-option name="replace" select="concat('&quot;',/*/@value, '&quot;')">
        <p:pipe step="for-each" port="current"/>
      </p:with-option>
    </p:string-replace>
  </p:for-each>]]></programlisting>

<para xml:id='p22'>To get the hidden fields, we unescape the markup. XML Calabash uses
<link xlink:href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</link>
for “<literal>text/html</literal>” pages, so we'll get well-formed XML.</para>

<para xml:id='p23'>The <tag>p:for-each</tag> loop selects each of the
hidden input fields and transforms them into <tag>c:param</tag>
elements. We'll need those later.</para>

<para xml:id='p24'>Next, we have to construct the <tag>c:param</tag> for the
“<literal>savetext</literal>” parameter that contains our wiki markup.
This one's a bit tricky.</para>

<programlisting><![CDATA[  <p:string-replace name="savetext" match="/c:param/@value">
    <p:input port="source">
      <p:inline>
        <c:param name="savetext" value="@@HERE@@"/>
      </p:inline>
    </p:input>
    <p:with-option name="replace" select='concat("&apos;",replace(c:data,"&apos;","&apos;&apos;"),"&apos;")'>
      <p:pipe step="main" port="source"/>
    </p:with-option>
  </p:string-replace>]]></programlisting>

<para xml:id='p25'>What the hell, I hear you ask, is up with that “<literal>replace</literal>”
value?</para>

<para xml:id='p26'>Well, see, what's going to appear on the <literal>source</literal> input
port of our pipeline is a <tag>c:data</tag> element that contains the wiki
markup of the page. The <option>replace</option> option <emphasis>is interpolated</emphasis>
as an XPath expression, so we have to “quote” the value. This is a common
idiom in <tag>p:string-replace</tag><footnote><para xml:id='p27'>So common that I regret
not providing some sort of syntactic shortcut for it. Oh, well, there's always
version 1.1.</para></footnote>. Except, in this case, <emphasis>the
value</emphasis> may contain both double and single quotes, so we need
to make sure that they don't result in an invalid XPath
expression!</para>

<para xml:id='p28'>Imagine that this is our <tag>c:data</tag> element:</para>

<programlisting><![CDATA[<c:data>"Hello 'world'"</c:data>]]></programlisting>

<para xml:id='p29'>If we do the usual quoting trick, the resulting XPath expression will
be:</para>

<programlisting><![CDATA['"Hello 'world'"']]></programlisting>

<para xml:id='p30'>and that's not a syntactically valid XPath string value. So we use
<function>replace</function> to double-up the apostrophes. That gives us</para>

<programlisting><![CDATA['"Hello ''world''"']]></programlisting>

<para xml:id='p31'>which is what we want. That took me a minute or two, believe you me.</para>

<para xml:id='p32'>Next we wrap all our <tag>c:param</tag> elements in a <tag>c:param-set</tag>,
construct a <tag>c:request</tag> to hold them, and use <tag>p:www-form-urlencode</tag>
to encode them.</para>

<programlisting><![CDATA[  <p:wrap-sequence name="wrap" wrapper="c:param-set">
    <p:input port="source">
      <p:pipe step="for-each" port="result"/>
      <p:pipe step="savetext" port="result"/>
    </p:input>
  </p:wrap-sequence>

  <p:string-replace match="/c:request/@href" cx:depends-on="wrap">
    <p:input port="source">
      <p:inline>
        <c:request method="POST" detailed="true" href="@@HERE@@">
          <c:body content-type="application/x-www-form-urlencoded">@@HERE@@</c:body>
        </c:request>
      </p:inline>
    </p:input>
    <p:with-option name="replace" select="concat('&quot;', $pageuri, '&quot;')"/>
  </p:string-replace>

  <p:www-form-urlencode match="/c:request/c:body/text()">
    <p:input port="parameters">
      <p:pipe step="wrap" port="result"/>
    </p:input>
  </p:www-form-urlencode>]]></programlisting>

<para xml:id='p33'>Send that off to the server and we're done!</para>

<programlisting><![CDATA[  <p:http-request cx:cookies="login"/>

  <p:delete match="/c:response/*"/>

</p:declare-step>]]></programlisting>

<para xml:id='p34'>I display the result, after deleting its contents, just to make sure
that I got a 200 back.</para>

<para xml:id='p35'>That little XProc script got all the pages loaded in just a couple of
minutes. FTW!</para>

<para xml:id='p36'>If you're interested, the <link xlink:href="examples/wikiedit.xpl">whole
script</link> is available.</para>

</essay>
