Skip to content

Multiline Code Blocks for Markdown With Syntax Highlighting

Since I'm releasing lots of code on this site, being able to have code blocks in articles sounded like a good idea.

At first, I wrapped the code in standard XHTML tags and added some formatting via CSS.

Then, I started to use the Markdown syntax (through its Python port) in my posts, and their source became much easier to read and write.

Unfortunately, I ran into some serious problems with them. It's been a while ago and I cannot remember anymore what exactly was the issue, sorry. As far as I can remember, they were related to indention, line breaks and (unintentional) Markdown syntax in a code block.

However, syntax highlighting was also not available and I really wanted it for better readability. I've already been using the great Pygments package for my snippets section and it proved to be an excellent choice.

So, I wanted to have a Markdown syntax element like this:

[sourcecode:lexer]
some code
[/sourcecode]

lexer can be any language short name supported by Pygments.

Here is a Markdown preprocessor that uses Pygments to highlight the content enclosed by the above syntax:

import re

from markdown import Preprocessor
from pygments import highlight
from pygments.formatters import HtmlFormatter
from pygments.lexers import get_lexer_by_name, TextLexer


class CodeBlockPreprocessor(Preprocessor):

    pattern = re.compile(
        r'\[sourcecode:(.+?)\](.+?)\[/sourcecode\]', re.S)

    def run(self, lines):
        def repl(m):
            try:
                lexer = get_lexer_by_name(m.group(1))
            except ValueError:
                lexer = TextLexer()
            code = highlight(m.group(2), lexer, HtmlFormatter())
            code = code.replace('\n\n', '\n \n')
            return '\n\n<div class="code">%s</div>\n\n' % code
        return self.pattern.sub(
            repl, '\n'.join(lines)).split('\n')

Then, the preprocessor can be integrated like this:

from markdown import Markdown

md = Markdown()
md.preprocessors.insert(0, CodeBlockPreprocessor())
markdown = md.__str__

markdown is then a callable that can be passed to the context of a template and used in that template, for example.

Finally, have Pygments generate a stylesheet (pygments.css in this example) to be embedded into the website:

$ pygmentize -S <some style> -f html > pygments.css

Here you are, enjoy the new colorful code presentation!

Update: As of Pygments 0.9, released October 14, 2007, the code presented here is included in the distribution as external/markdown-processor.py.

Update: In case you're writing documents in reStructuredText which contain code blocks, you might appreciate that as of docutils 0.9, released May 02, 2012, both a directive as well as an interpreted role for code, highlighted by Pygments, are supported out-of-the-box.

Git Submodules

I had some trouble dealing with submodules. This is what I experienced, learned, and did to solve the issue.

Using a custom SSH URL

Let's see if our repository has submodules:

$ git submodule
 9d83ae91cffb5840a5ff722e9f1dabc2d74c96ae vagrant/cookbooks (9d83ae9)

Yes, it does. Take a look at the submodule details which are stored in .gitmodules; it should look like this:

[submodule "vagrant/cookbooks"]
    path = vagrant/cookbooks
    url = git@github.com:AcmeCorporation/FlagshipProduct-cookbooks.git

Unfortunately, that's not the URL we want to use but instead we have a dedicated alias in ~/.ssh/config:

Host github
    HostName github.com
    User git
    IdentityFile ~/.ssh/id_rsa_github

But first, initialize the submodule (this has to be done from the repository's root path):

$ git submodule init
Submodule 'vagrant/cookbooks' (git@github.com:AcmeCorporation/FlagshipProduct-cookbooks.git) registered for path 'vagrant/cookbooks'

Make sure that there is an URL shown between the parentheses; if there isn't, something is already wrong.

That step should have added a section to .git/config:

[…]
[submodule "vagrant/cookbooks"]
    url = git@github.com:AcmeCorporation/FlagshipProduct-cookbooks.git

If we would have tried to update the submodule with the original SSH URL, this might fail:

$ git submodule update
Permission denied (publickey).
fatal: The remote end hung up unexpectedly
Unable to fetch in submodule path 'vagrant/cookbooks'

Just adjust the SSH URL in .git/config as necessary (in our case, use above SSH host alias):

[…]
[submodule "vagrant/cookbooks"]
    url = github:AcmeCorporation/FlagshipProduct-cookbooks.git

Now we can finally update the submodule:

$ git submodule update
Cloning into 'vagrant/cookbooks'...
remote: Counting objects: 302, done.
remote: Compressing objects: 100% (216/216), done.
remote: Total 302 (delta 86), reused 273 (delta 59)
Receiving objects: 100% (302/302), 149.71 KiB | 198 KiB/s, done.
Resolving deltas: 100% (86/86), done.
Submodule path 'vagrant/cookbooks': checked out '9d83ae91cffb5840a5ff722e9f1dabc2d74c96ae'

Yay! :)

Troubleshooting

This happened to me after I initialized the submodule:

$ git submodule update
fatal: Needed a single revision
Unable to find current revision in submodule path 'vagrant/cookbooks'

An FAQ by Gostai states:

This is the sign that the initial checkout […] went completely wrong (I don’t know what makes this happen). Chances are that the directory exists, but is empty. Git does not seem to be able to overcome this situation, […]

So just delete the submodule's directory:

$ rm -rf vagrant/cookbooks

As pointed out in a comment to an answer on Stack Overflow, this might not be sufficient, so delete Git's internal related directory, too:

$ rm -rf .git/modules/vagrant/cookbooks

Now start again by initializing or updating the submodule.

Extract title and metadata from a reStructuredText document

# -*- coding: utf-8 -*-

"""
Extract title and metadata from a reStructuredText document
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This functionality was sourced out of the docutils_ integration of the
`homework productions`_ web application.

Its purpose is to transform reStructuredText_ documents to HTML, but extract
the title and metadata before rendering the body and provide them separately.

The resulting HTML content would then be assembled from those fragments, but
with more flexibility (e.g. the document's date and tags can be rendered
according to a template instead of how docutils_ would generate markup).

The metadata is parsed from a reStructuredText_ field list. The fields that
should be extracted have to be specified along with a function that parses
the string value. Unspecified fields are discarded.

.. _docutils: http://docutils.sourceforge.net/
.. _homework productions: http://homework.nwsnet.de/
.. _reStructuredText: http://docutils.sourceforge.net/rst.html

:Copyright: 2007-2012 Jochen Kupperschmidt
:Date: 13-Jun-2012
:License: MIT
"""

from collections import namedtuple
from contextlib import contextmanager
from datetime import date, datetime

from docutils import core, io, nodes, readers


DocumentParts = namedtuple('DocumentParts', ['metadata', 'title', 'body'])

def parse_document(input_string, field_names_and_parsers):
    """
    Parse the input string as a reStructuredText document and return these
    values, wrapped in a named tuple:

    - ``metadata``: A dictionary with metadata extracted from the first field
      list in the document. A field is only considered if it is explicitly
      specified, and its value will be transformed using the function assigned
      for it.

    - ``title``: The document's first-level heading.

    - ``body``: The document body, rendered as HTML. This will not include the
      first field list and the first-level heading unless ``False`` is passed
      passed as the value of the ``remove`` argument.
    """
    overrides = {
        # Disable the promotion of a lone top-level section title to document
        # title (and subsequent section title to document subtitle promotion).
        'docinfo_xform': 0,
        'initial_header_level': 2,
    }

    # Read tree and extract metadata.
    doctree = core.publish_doctree(input_string, settings_overrides=overrides)

    title = extract_title(doctree)
    metadata = extract_metadata(doctree, field_names_and_parsers)

    # Parse content.
    reader = readers.doctree.Reader(parser_name='null')
    pub = core.Publisher(reader, source=io.DocTreeInput(doctree),
         destination_class=io.StringOutput)
    pub.set_writer('html')
    # Make ``initial_header_level`` work.
    pub.process_programmatic_settings(None, overrides, None)
    pub.publish()

    return DocumentParts(
        metadata=metadata,
        title=title,
        body=pub.writer.parts['html_body'],
    )

@contextmanager
def find_node_by_class(doctree, node_class, remove):
    """Find the first node of the specified class."""
    index = doctree.first_child_matching_class(node_class)
    if index is not None:
        yield doctree[index]
        if remove:
            del doctree[index]
    else:
        yield

def extract_title(doctree, remove=True):
    """Find, extract, optionally remove, and return the document's first
    heading (which is assumed to be the main title).
    """
    with find_node_by_class(doctree, nodes.title, remove) as node:
        if node is not None:
            return node.astext()

def extract_metadata(doctree, field_names_and_parsers, remove=True):
    """Find, extract, optionally remove, and return the values for the
    specified names from the document's first field list (which is assumed to
    represent the document's meta data).
    """
    field_names = frozenset(field_names_and_parsers.viewkeys())
    metadata = dict.fromkeys(field_names)

    with find_node_by_class(doctree, nodes.field_list, remove) as node:
        if node is not None:
            field_nodes = select_field_nodes(node, field_names)
            # Parse each field's value using the function
            # specified for the field's name.
            for name, value in field_nodes:
                metadata[name] = field_names_and_parsers[name](value)

    return metadata

def select_field_nodes(subtree, names):
    """Return a (name, value) pair for any node with one of the given names."""
    field_nodes = (node for node in subtree if node.__class__ is nodes.field)
    for field_node in field_nodes:
        name = field_node[0].astext().lower()
        if name in names:
            value = field_node[1].astext()
            yield name, value


# tests
#

TEST_INPUT = """\
=======
Example
=======

:Id: 42
:Author: John Doe
:Date: 2012-06-13
:Version: 0.1
:Tags: crazy, plain stupid, crazy, unexpected
:SomethingElse: This should be ignored.

Once upon a time ...\
"""

def test_parse_document():
    """Example usage as well as unit test."""
    expected = DocumentParts(
        metadata={
            'id': 42,
            'author': 'John Doe',
            'date': date(2012, 6, 13),
            'version': '0.1',
            'tags': frozenset(['crazy', 'plain stupid', 'unexpected']),
        },
        title='Example',
        body \
            = '<div class="document" id="example">\n' \
            + '<p>Once upon a time ...</p>\n</div>\n',
    )

    # Define field names to watch out for as well as
    # functions to parse their values.
    field_names_and_parsers = {
        'id': int,
        'author': str,
        'date': lambda s: datetime.strptime(s, '%Y-%m-%d').date(),
        'version': str,
        'tags': lambda s: frozenset(map(unicode.strip, s.split(','))),
    }

    actual = parse_document(TEST_INPUT, field_names_and_parsers)

    # Compare actual to expected values.
    for attr_name in 'metadata', 'title', 'body':
        assert_helper(actual, expected, attr_name)

def assert_helper(actual_obj, expected_obj, attr_name):
    actual = getattr(actual_obj, attr_name)
    expected = getattr(expected_obj, attr_name)
    assert actual == expected, \
        'Value of attribute "%s" must be %r but is %r.' \
            % (attr_name, expected, actual)

#
# /tests

if __name__ == '__main__':
    print 'Running tests ...',
    test_parse_document()
    print 'alright!'