Using Pygments from XSLT
Create highlighted source code listings in DocBook directly from XSLT.
This is just plain fun.
Syntax highlighting source code listings can make them easier to read (it certainly
makes them prettier).
Pygments is an excellent syntax
highlighter. It's a
python
tool that can transform source code listings into nice, clean HTML
using span
s and class
attributes.
I've integrated a syntax highlighter into my XML workflow many times and in a variety of ways, most often using make or ant or an XProc pipeline and XInclude.
The topic comes up periodically on the DocBook Apps mailing list. When it came up most recently, Jirka suggested that it would interesting to try integrating Pygments directly into Saxon with Jython. I thought that would be just the ticket. (It helped that I had just been digging around in the extension functions part of Saxon a few days earlier.)
There's no distribution for it yet, but it works and you can try it out with the latest commits to the DocBook XSLT 2.0 stylesheets. But first, you have to do a little setup.
-
Get Jython and Pygments setup and working. You can tell it's working if you can run jython and then “
import pygments
” successfully. -
That's half the battle. The other half is getting things setup so that when Python is called from Java it can find Pygments. On my system, I had to put an explicit
python.path
setting in~/.jython
to make it work.Assuming that you've got Jython and Pygments working, the rest is pretty easy. (If you don't, well, good luck! I don't really have anything helpful to add beyond what you'll find on the internets.)
-
Make sure that the DocBook extension jar file (
docbook-xsl2-saxon.jar
) and thejython.jar
file are in your class path. -
If you're running Saxon HE from the command line, make sure that you include
-init:docbook.Initializer
among the arguments. That'll run some code that registers the DocBook extension functions.If you're running Saxon PE or EE, you can put the extension functions in your config file:
… <resources> <extensionFunction>org.docbook.extensions.xslt20.Cwd</extensionFunction> <extensionFunction>org.docbook.extensions.xslt20.ImageIntrinsics</extensionFunction> <extensionFunction>org.docbook.extensions.xslt20.Pygmenter</extensionFunction> </resources>
-
Format a document that contains a
programlisting
and it should come out magically highlighted. -
If you don't see any highlighting, that's probably because you don't have the right CSS. You can get the CSS with this python script:
from pygments.formatters import HtmlFormatter print HtmlFormatter().get_style_defs('.highlight')
If something goes wrong in Pygments, your whole stylesheet will crash. At some point, I'll go back and work a little harder to be defensive about errors.
A few notes:
-
There are techniques for doing this in the browser, but I wanted to integrate it into the stylesheets so that additional decoration, like line numbering, would be applied to the highlighted listing.
-
Syntax highlighting is only applied to
programlisting
,screen
, andsynopsis
environments. Perhaps it should be limited exclusively toprogramlisting
. -
If you don't specify a
language
attribute on the listing, Pygments will try to guess. Sometimes it guesses wrong. You'll be better off if you explicitly specify one of the languages that Pygments understands. -
Syntax highlighting is not performed if any of the following conditions hold:
-
The
role
attribute contains the stringnopygments
. -
The listing contains any nested elements. If you've put
phrase
orreplaceable
or some such in your program listing, you're on your own. (There's no way Pygments would do the right thing anyway.) -
The listing is longer than 9,000 characters. I'm using recursion to handle tag boundaries and if you have too long a listing, the stack gets blown. I'll have to see if I can come up with a non-recursive approach.
-
-
The extension assumes that what Pygments returns is some wrappers around a single HTML
pre
that contains the entire listing. -
Adding Jython to the mix adds several seconds to the startup time when you run a transformation. If you leave
jython.jar
off your class path when you don't need it, the DocBook initialization code will silently omit the highlighting extension.
Ironically, these pages are formatted with MarkLogic Server which isn't Saxon and so nothing about Jython has any relevance at all.
What I did for MarkLogic was implement the extension as a web service that I can call from a stylesheet running on MarkLogic Server. The fact that XSLT makes this “just work” is really nice. Here's the place where the stylesheet decides what to do:
<xsl:choose>
<xsl:when test="contains(@role,'nopygments') or string-length(.) > 9000
or self::db:literallayout or exists(*)">
<xsl:apply-templates/>
</xsl:when>
<xsl:when use-when="function-available('xdmp:http-post')"
test="$pygmenter-uri != ''">
<xsl:sequence select="ext:highlight(string(.), string(@language))"/>
</xsl:when>
<xsl:when use-when="function-available('ext:highlight')"
test="true()">
<xsl:sequence select="ext:highlight(string(.), string(@language))"/>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates/>
</xsl:otherwise>
</xsl:choose>
-
The first branch checks to see if it should skip highlighting. (The
literallayout
test is only needed because this is in a template that happens to handle monospaced literal layouts, which I'm just assuming aren't program listings.) -
The presense of the second branch is conditional on the availability of the
xdmp:http-post
function (which is only likely inside MarkLogic Server). If you're not running on our server, thatxsl:when
just gets deleted at “compile time” so nothing it contains can possibily matter.If you are running on MarkLogic Server, then we'll test that you've defined the
$pygmenter-uri
endpoint. Can't run the highlighter if you haven't told us where it is. -
The presence of the third branch is conditional on the availability of
ext:highlight
at “compile time” so if you're not running on Saxon or your not running with my extensions, that branch just gets deleted as well.If the function exists, we can run it so the test here is just “
true()
”. -
Finally, if you've run out of options, just format the listing.
Later on, there's a declaration for the ext:highlight
function.
Note that it's presence is also conditional. It'll only be seen by MarkLogic Server.
On Saxon, the function is provided through the processor's extension mechanisms; it
can't be implemented in XSLT.
<xsl:function use-when="function-available('xdmp:http-post')"
name="ext:highlight" as="node()*">
<xsl:param name="code"/>
<xsl:param name="language"/>
<xsl:variable name="code-node" as="text()">
<xsl:value-of select="$code"/>
</xsl:variable>
<xsl:variable name="highlighted"
select="xdmp:http-post(concat($pygmenter-uri,'?language=',$language),(),$code-node)"/>
<xsl:sequence select="$highlighted[2]//h:pre/node()"/>
</xsl:function>
Anyway, that's how the formatting is being done on these pages. Here's the endpoint, in case you're interested (it serves as another nice example of syntax highlighting even if you aren't interested):
#!/usr/local/bin/python
import sys, os, re, string;
from pygments import highlight
from pygments.lexers import (get_lexer_by_name, get_lexer_for_mimetype)
from pygments.lexers import guess_lexer
from pygments.formatters import (HtmlFormatter, get_formatter_by_name)
from pygments.util import ClassNotFound
print "Content-Type: application/xml"
print ""
query = os.environ["QUERY_STRING"]
code = ""
for line in sys.stdin:
code = code + line.decode("utf8")
language = re.search("language=([^?&]+)", os.environ["QUERY_STRING"])
if language is None:
language = ""
else:
language = language.group(1)
formatname = re.search("formatter=([^?&]+)", os.environ["QUERY_STRING"])
if formatname is None:
formatname = "html"
else:
formatname = formatname.group(1)
if language == "":
lexer = guess_lexer(code)
else:
try:
lexer = get_lexer_by_name(language)
except ClassNotFound:
lexer = ""
formatter = get_formatter_by_name(formatname)
try:
result = highlight(code, lexer, formatter)
except AttributeError:
sys.stderr.write("Failed to highlight code: lexer=" + lexer + " formatter=" + formatname + "\n")
sys.stderr.write(code.encode("ascii","xmlcharrefreplace") + "\n")
code = code.replace("&", "&").replace("<", "<").replace(">",">")
result = "<pre>" + code + "</pre>"
# Put it in the right namespace
result = "<div xmlns='http://www.w3.org/1999/xhtml'>" + result + "</div>"
print result.encode("ascii","xmlcharrefreplace")
I'm quite pleased that I was able to make it work in both Saxon and MarkLogic server! (The web service solution won't work in Saxon because there's no standard way to do an HTTP POST with XSLT, but if you have an extension that will POST, you could craft an extension function that would work in Saxon using the web service. That would let you avoid the overhead of Jython.)
Comments
Perhaps you might wish to suggest a new topic for the next edition of "DocBook XSL: The Complete Guide - 4th Edition" (Bob Stayton) to replace (or add to) this topic http://www.sagehill.net/docbookxsl/SyntaxHighlighting.html
I believe that book is widely read.