Elements Of a Great Markup Language

This post contains some inconclusive musing on lightweight markup languages (Markdown, AsciiDoc, LaTeX, reStructuredText, etc). The overall mood is that I dont think a genuinely great markup languages exists. I wish it did though. As an appropriate disclosure, this text is written in AsciiDoctor.

EDIT: if you like this post, you should definitely check out https://djot.net.

EDIT: welp, that escalated quickly, this post is now written in Djot.

Document Model

This I think is the big one. Very often, a particular markup language is married to a particular output format, either syntactically (markdown supports HTML syntax), or by the processor just not making a crisp enough distinction between the input document and the output (AsciiDoctor).

Roughly, if the markup language is for emitting HTML, or PDF, or DocBook XML, thats bad. A good markup language describes an abstract hierarchical structure of the document, and lets a separate program to adapt that structure to the desired output.

More or less, what I want from markup is to convert a text string into a document tree:

enum Element {
  Text(String),
  Node {
    tag: String,
    attributes: Map<String, String>
    children: Vec<Element>,
  }
}

fn parse_markup(input: &str) -> Element { ... }

Markup language which nails this perfectly is HTML. It directly expresses this tree structure. Various viewers for HTML can then render the document in a particular fashion. HTMLs syntax itself doesnt really care about tag names and semantics: you can imagine authoring HTML documents using an alternative set of tag names.

Markup language which completely falls over this is Markdown. Theres no way to express generic tree structure, conversion to HTML with specific browser tags is hard-coded.

Language which does this half-good is AsciiDoctor.

In AsciiDoctor, it is possible to express genuine nesting. Heres a bunch of nested blocks with some inline content and attributes:

====
Here are your options:

.Red Pill
[%collapsible]
======
Escape into the real world.
======

.Blue Pill
[%collapsible]
======
Live within the simulated reality without want or fear.
======

====

The problem with AsciiDoctor is that generic blocks come of as a bit of implementation detail, not as a foundation. It is difficult to untangle presentation-specific semantics of particular blocks (examples, admonitions, etc) from the generic document structure. As a fun consequence, a semantic-neutral block (equivalent of a </div>) is the only kind of block which cant actually nest in AsciiDoctor, due to syntactic ambiguity.

Concrete Syntax

Syntax matters. For lightweight text markup languages, syntax is of utmost importance.

The only right way to spell a list is

- Foo
- Bar
- Baz

Not

<ul>
    <li>Foo</li>
    <li>Bar</li>
    <li>Baz</li>
</ul>

And most definitely not

\begin{itemize}
    \item foo
    \item Bar
    \item Baz
\end{itemize}

Similarly, you lose if you spell links like this:

`My Blog <https://matklad.github.io>`_

Markdown is the trailblazer here, it picked a lot of great concrete syntaxes. Though, some choices are questionable, like trailing double space rule, or the syntax for including images.

AsciiDoctor is the treasure trove of tasteful syntactic decisions.

Inline Formatting

For example *bold* is bold, _italics_ is italics, and repeating the emphasis symbol twice (__like *this*__) allows for unambiguous nesting.

Lists

Another tasteful decision are numbered lists, which use . to avoid tedious renumbering:

[lowerroman]
. One
. Two
. Three
  1. One
  2. Two
  3. Three

Tables

And AsciiDoctor also has a reasonable-ish syntax for tables, with one-line per cell and a blank like to delimit rows.

[cols="1,1"]
|===
|First
|Row

|X
|Y

|Last
|Row
|===
First Row
X Y
Last Row

Composable Processing

To convert our nice, sweet syntax to general tree and than into the final output, we need some kind of a tool. One way to do that is by direct translation from our source document to, eg, html.

Such one-step translation is convenient for all-inclusive tools, but is a barrier for extensibility. Amusingly, AsciiDoctor is both a positive and a negative example here.

On the negative side of things, classical AsciiDoctor is an extensible Ruby processor. To extend it, you essentially write a compiler plugin a bit of Ruby code which gets hook into the main processor and gets invoked as a callback when certain tags are parsed. This plugin interacts with the Ruby API of the processor itself, and is tied to a particular toolchain.

In contrast, asciidoctor-web-pdf, a newer thing (which non-the-less uses the same Ruby core), approaches the task a bit differently. Theres no API to extend the processor itself. Rather, the processor produces an abstract document tree, and then a user-supplied JavaScript function can convert that piece of data into whatever html it needs, by following a lightweight visitor pattern. I think this is the key to a rich ecosystem: strictly separate converting input text to an abstract document model from rendering the model through some template. The two parts could be done by two separate processes which exchange serialized data. Its even possible to imagine some canonical JSON encoding of the parsed document.

Theres one more behavior where all-inclusive approach of AsciiDoctor gets in a way of doing the right thing. AsciiDoctor supports includes, and they are textual, preprocessor includes, meaning that syntax of the included file affects what follows afterwards. A much cleaner solution would have been to keep includes in the document tree as distinct nodes (with the path to the included file as an attribute), and let it to the output layer to interpret those as either verbatim text, or subdocuments.

Another aspect of composability is that the parsing part of the processing should have, at minimum, a lightweight, embeddable implementation. Ideally, of course, theres a spec and an array of implementations to choose from.

Markdown fairs fairly well here: there never was a shortage of implementations, and today we even have a bunch of different specs!

AsciiDoctorWell, I am amazed. The original implementation of AsciiDoc was in Python. AsciiDoctor, the current tool, is in Ruby. Neither is too embeddable. But! AsciiDoctor folks are crazy, they compiled Ruby to JavaScript (and Java), and so the toolchain is available on JVM and Node. At least for Node, I can confidently say that thats a real production-ready thing which is quite convenient to use! Still, Id prefer a Rust library or a small WebAssembly blob instead.

A different aspect of composability is extensibility. In Markdown land, the usual answer for when Markdown doesnt quite do everything needed (i.e., in 90% of cases), the answer is to extend concrete syntax. This is quite unfortunate, changing syntax is hard. A much better avenue I think is to take advantage of the generic tree structure, and extend the output layer instead. Tree-with-attributes should be enough to express whatever structure is needed, and than its up to the converter to pattern-match this structure and emit its special thing.

Do you remember the fancy two-column rendering above with source-code on the left, and rendered document on the right? This is how Ive done it:

[.two-col]
--
```
[lowerroman]
. One
. Two
. Three
```

[lowerroman]
. One
. Two
. Three
--

That is, a generic block, with .two-col attribute and two children a listing block and a list. Then theres a separate css which assigns an appropriate flexbox layout for .two-col elements. Theres no need for special two column layout extension. It would be perhaps nice to have a dedicated syntax here, but just re-using generic -- block is quite ok!

Where Do We Stand Now?

Not quite there, I would think! AsciiDoctor at least half-ticks quite a few of the checkboxes, but it is still not perfect.

There is a specification in progress, I have high hopes that itll spur alternative implementations (and most of AsciiDoctor problems are implementation issues). At the same time, I am not overly-optimistic. The overriding goal for AsciiDoctor is compatibility, and rightfully so. Theres a lot of content already written, and I would hate to migrate this blog, for example :)

At the same time, there are quite a few rough edges in AsciiDoctor:

  • includes
  • non-nestable generic blocks
  • many ways to do certain things (AsciiDoctor essentially supports the union of Markdown and AsciiDoc concrete syntaxes)
  • lack of some concrete sugar (reference-style links are notably better in Markdown)

It feels like theres a smaller, simpler language somewhere (no, I will not link that xkcd for once (though xkcd:927[] would be a nice use of AsciiDoctor extensibility))

On the positive side of things, it seems that in the recent years we built a lot of infrastructure to make these kinds of projects more feasible.

Rust is just about the perfect language to take a String from a user and parse it into some sort of a tree, while packaging the whole thing into a self-contained zero-dependency, highly embeddable, reliable, and reusable library.

WebAssembly greatly extends reusability of low-level libraries: between a static library with a C ABI, and a .wasm module, you got all important platforms covered.

True extensibility fundamentally requires taking code as input data. A converter from a great markup language to HTML should accept some user-written script file as an argument, to do fine tweaking of the conversion process. WebAssembly can be a part of the solution, it is a toolchain-neutral way of expressing computation. But we have something even more appropriate. Deno with its friendly scripting language with nice template literals and a capabilities based security model, is just about the perfect runtime to implement a static site generator which takes a bunch of input documents, a custom conversion script, and outputs a bunch of HTML files.

If I didnt have anything else to do, Id certainly be writing my own lightweight markup language today!