Using Pygments from XSLT

Volume 14, Issue 33; 31 Aug 2011

Create highlighted source code listings in DocBook directly from XSLT.

This is just plain fun.

Syntax highlighting source code listings can make them easier to read (it certainly makes them prettier). Pygments is an excellent syntax highlighter. It's a python tool that can transform source code listings into nice, clean HTML using spans and class attributes.

I've integrated a syntax highlighter into my XML workflow many times and in a variety of ways, most often using make or ant or an XProc pipeline and XInclude.

The topic comes up periodically on the DocBook Apps mailing list. When it came up most recently, Jirka suggested that it would interesting to try integrating Pygments directly into Saxon with Jython. I thought that would be just the ticket. (It helped that I had just been digging around in the extension functions part of Saxon a few days earlier.)

There's no distribution for it yet, but it works and you can try it out with the latest commits to the DocBook XSLT 2.0 stylesheets. But first, you have to do a little setup.

  1. Get Jython and Pygments setup and working. You can tell it's working if you can run jython and then “import pygments” successfully.

  2. That's half the battle. The other half is getting things setup so that when Python is called from Java it can find Pygments. On my system, I had to put an explicit python.path setting in ~/.jython to make it work.

    Assuming that you've got Jython and Pygments working, the rest is pretty easy. (If you don't, well, good luck! I don't really have anything helpful to add beyond what you'll find on the internets.)

  3. Make sure that the DocBook extension jar file (docbook-xsl2-saxon.jar) and the jython.jar file are in your class path.

  4. If you're running Saxon HE from the command line, make sure that you include -init:docbook.Initializer among the arguments. That'll run some code that registers the DocBook extension functions.

    If you're running Saxon PE or EE, you can put the extension functions in your config file:

      …
      <resources>
        <extensionFunction>org.docbook.extensions.xslt20.Cwd</extensionFunction>
        <extensionFunction>org.docbook.extensions.xslt20.ImageIntrinsics</extensionFunction>
        <extensionFunction>org.docbook.extensions.xslt20.Pygmenter</extensionFunction>
      </resources>
  5. Format a document that contains a programlisting and it should come out magically highlighted.

  6. If you don't see any highlighting, that's probably because you don't have the right CSS. You can get the CSS with this python script:

    from pygments.formatters import HtmlFormatter
    
    print HtmlFormatter().get_style_defs('.highlight')

If something goes wrong in Pygments, your whole stylesheet will crash. At some point, I'll go back and work a little harder to be defensive about errors.

A few notes:

  • There are techniques for doing this in the browser, but I wanted to integrate it into the stylesheets so that additional decoration, like line numbering, would be applied to the highlighted listing.

  • Syntax highlighting is only applied to programlisting, screen, and synopsis environments. Perhaps it should be limited exclusively to programlisting.

  • If you don't specify a language attribute on the listing, Pygments will try to guess. Sometimes it guesses wrong. You'll be better off if you explicitly specify one of the languages that Pygments understands.

  • Syntax highlighting is not performed if any of the following conditions hold:

    • The role attribute contains the string nopygments.

    • The listing contains any nested elements. If you've put phrase or replaceable or some such in your program listing, you're on your own. (There's no way Pygments would do the right thing anyway.)

    • The listing is longer than 9,000 characters. I'm using recursion to handle tag boundaries and if you have too long a listing, the stack gets blown. I'll have to see if I can come up with a non-recursive approach.

  • The extension assumes that what Pygments returns is some wrappers around a single HTML pre that contains the entire listing.

  • Adding Jython to the mix adds several seconds to the startup time when you run a transformation. If you leave jython.jar off your class path when you don't need it, the DocBook initialization code will silently omit the highlighting extension.

Ironically, these pages are formatted with MarkLogic Server which isn't Saxon and so nothing about Jython has any relevance at all.

What I did for MarkLogic was implement the extension as a web service that I can call from a stylesheet running on MarkLogic Server. The fact that XSLT makes this “just work” is really nice. Here's the place where the stylesheet decides what to do:

<xsl:choose>
  <xsl:when test="contains(@role,'nopygments') or string-length(.) &gt; 9000
                  or self::db:literallayout or exists(*)">
    <xsl:apply-templates/>
  </xsl:when>

  <xsl:when use-when="function-available('xdmp:http-post')"
            test="$pygmenter-uri != ''">
    <xsl:sequence select="ext:highlight(string(.), string(@language))"/>
  </xsl:when>

  <xsl:when use-when="function-available('ext:highlight')"
            test="true()">
    <xsl:sequence select="ext:highlight(string(.), string(@language))"/>
  </xsl:when>

  <xsl:otherwise>
    <xsl:apply-templates/>
  </xsl:otherwise>
</xsl:choose>
  • The first branch checks to see if it should skip highlighting. (The literallayout test is only needed because this is in a template that happens to handle monospaced literal layouts, which I'm just assuming aren't program listings.)

  • The presense of the second branch is conditional on the availability of the xdmp:http-post function (which is only likely inside MarkLogic Server). If you're not running on our server, that xsl:when just gets deleted at “compile time” so nothing it contains can possibily matter.

    If you are running on MarkLogic Server, then we'll test that you've defined the $pygmenter-uri endpoint. Can't run the highlighter if you haven't told us where it is.

  • The presence of the third branch is conditional on the availability of ext:highlight at “compile time” so if you're not running on Saxon or your not running with my extensions, that branch just gets deleted as well.

    If the function exists, we can run it so the test here is just “true()”.

  • Finally, if you've run out of options, just format the listing.

Later on, there's a declaration for the ext:highlight function. Note that it's presence is also conditional. It'll only be seen by MarkLogic Server. On Saxon, the function is provided through the processor's extension mechanisms; it can't be implemented in XSLT.

<xsl:function use-when="function-available('xdmp:http-post')"
              name="ext:highlight" as="node()*">
  <xsl:param name="code"/>
  <xsl:param name="language"/>

  <xsl:variable name="code-node" as="text()">
    <xsl:value-of select="$code"/>
  </xsl:variable>

  <xsl:variable name="highlighted"
                select="xdmp:http-post(concat($pygmenter-uri,'?language=',$language),(),$code-node)"/>

  <xsl:sequence select="$highlighted[2]//h:pre/node()"/>
</xsl:function>

Anyway, that's how the formatting is being done on these pages. Here's the endpoint, in case you're interested (it serves as another nice example of syntax highlighting even if you aren't interested):

#!/usr/local/bin/python

import sys, os, re, string;

from pygments import highlight
from pygments.lexers import (get_lexer_by_name, get_lexer_for_mimetype)
from pygments.lexers import guess_lexer
from pygments.formatters import (HtmlFormatter, get_formatter_by_name)
from pygments.util import ClassNotFound

print "Content-Type: application/xml"
print ""

query = os.environ["QUERY_STRING"]
code = ""

for line in sys.stdin:
    code = code + line.decode("utf8")

language = re.search("language=([^?&]+)", os.environ["QUERY_STRING"])
if language is None:
    language = ""
else:
    language = language.group(1)

formatname = re.search("formatter=([^?&]+)", os.environ["QUERY_STRING"])
if formatname is None:
    formatname = "html"
else:
    formatname = formatname.group(1)

if language == "":
    lexer = guess_lexer(code)
else:
    try:
        lexer = get_lexer_by_name(language)
    except ClassNotFound:
        lexer = ""

formatter = get_formatter_by_name(formatname)

try:
    result = highlight(code, lexer, formatter)
except AttributeError:
    sys.stderr.write("Failed to highlight code: lexer=" + lexer + " formatter=" + formatname + "\n")
    sys.stderr.write(code.encode("ascii","xmlcharrefreplace") + "\n")
    code = code.replace("&", "&amp;").replace("<", "&lt;").replace(">","&gt;")
    result = "<pre>" + code + "</pre>"

# Put it in the right namespace
result = "<div xmlns='http://www.w3.org/1999/xhtml'>" + result + "</div>"

print result.encode("ascii","xmlcharrefreplace")

I'm quite pleased that I was able to make it work in both Saxon and MarkLogic server! (The web service solution won't work in Saxon because there's no standard way to do an HTTP POST with XSLT, but if you have an extension that will POST, you could craft an extension function that would work in Saxon using the web service. That would let you avoid the overhead of Jython.)

Comments

Perhaps you might wish to suggest a new topic for the next edition of "DocBook XSL: The Complete Guide - 4th Edition" (Bob Stayton) to replace (or add to) this topic http://www.sagehill.net/docbookxsl/SyntaxHighlighting.html

I believe that book is widely read.

—Posted by Derek Read on 01 Sep 2011 @ 10:48 UTC #