Pandoc: Generalized syntax for raw blocks in Markdown?

Created on 28 Mar 2017  路  48Comments  路  Source: jgm/pandoc

Pandoc's Markdown allows you to insert raw HTML and LaTeX blocks, recognizing these automatically. But it might be nice to include a way to insert arbitrary raw blocks in arbitrary formats.

The simplest approach would be to overload the fenced code block syntax.

``` {raw="ms"} .MYMACRO blah blah ```

It would make sense to add an extension for this behavior, perhaps raw_literal_blocks.

See this thread for background.

Markdown reader

Most helpful comment

Thinking about it a bit more, I'm liking
```` f[I]`{=ms}

{=ms} .MYMACRO
`````

All 48 comments

Why not also allow overloading of inline code syntax?

See the doctext for [my filter] for possible use cases.

+++ Benct Philip Jonsson [Mar 29 17 07:30 ]:

Why not also allow overloading of inline code syntax?

Indeed, if we do blocks we should do inlines too.

Maybe the declaration in code block example provided by @jgm could be simplified from raw="ms" to just ms:

``` {ms}
.MYMACRO
blah blah
```

And similar for inline code to just:

`ms .MYMACRO ...`

Approach demonstrated in examples above are what is used in R Markdown and it makes declarations less verbose. As I understand in Pandoc currently keys without a value are parsed as block's style when prefixed with a dot. Perhaps the first occurrence of undotted version could be used to cover this case of overloaded code block.

I see the advantages to the short form, but against it:

  • it requires a revision to current attribute syntax
  • the distinction between {html} (raw) and {.html} (code)
    becomes a bit too subtle (one period).

I think I prefer the more explicit raw=ms version.

  • in addition to being short, the ms only version makes the meaning obvious from the syntax as it would not treat one of of the attributes as special
  • it would also be possible to use both at the same time e.g. to use the attribute as a version of the raw block or making it accessible for special treatment in (lua) filters at the same time
  • when the document is interspersed with many code blocks (e.g. literate programming) it adds a lot less visual noise, making it easier to focus on the meaning
  • is markdown in various flavors not all about subtlety which makes the usage keyboard friendly: atx headers vs hashes, 'at' signs vs numbering, numbered lists etc.
  • also as the dotted/undotted version approach is used in css style definition to distinguish between html block and class name I believe that general familiarity with that would make the dot prefix stand out already
  • approach would be consistent with existing use cases and actual use, hence it would follow the principle of least surprise

These are all good points. I was thinking that one drawback
of the {ms} syntax is that it would break with earlier
versions of pandoc, rather than falling back gracefully to
a code block. But then I thought: maybe breaking is what
you want in this case. And then I tried it and found that
pandoc actually deals with {ms} by treating it as a
language name! (Remember that fenced code blocks can take
either a simple language name or pandoc attribute specifier
in braces.) This is unintended, though, and would be easy
to fix.

I still worry, though, about the close visual similarity
between {ms} and {.ms}, which would, on this proposal,
have very different meanings. But perhaps this is
outweighed by the considerations you give.

* in addition to being short, the ms only version makes the meaning
  obvious from the syntax as it would not treat one of of the
  attributes as special
* it would also be possible to use both at the same time e.g. to use
  the attribute as a version of the raw block or making it accessible
  for special treatment in (lua) filters at the same time
* when the document is interspersed with many code blocks (e.g.
  literate programming) it adds a lot less visual noise, making it
  easier to focus on the meaning
* is markdown in various flavors not all about subtlety which makes
  the usage keyboard friendly: atx headers vs hashes, 'at' signs vs
  numbering, numbered lists etc.
* approach would be consistent with existing use cases and actual
  use, hence it would follow the principle of least surprise

Another argument in favor of the {html} syntax is internationalization: markdown has very few English keywords and is hence equally well suited for text written in other languages. This seems like a property worth preserving.

Everyone of course wants _their_ extension (i.e. their key, raw= in this case) to be the one requiring no key in the attribute list. Always on the grounds that it would look cleaner and already be internationalised. Why should exactly this extension get that favourite treatment, and not say the lang attribute?

It's one thing to reserve special characters for some keys (as we've already done with # and . for id= and class=) which I think we can talk about, quite another to leave the prefix away entirely for one special key.

You do have a point, though I'd say the privileged syntax seems justified here: it is does not represent just another attribute but marks a completely separate element type.

As for the alternative: I'd prefer an example filter shipped with pandoc over special treatment of the raw=html syntax. It seems less cluttering and would allow for easy i18n by adapting the filter.

+++ Albert Krewinkel [Mar 30 17 00:50 ]:

You do have a point, though I'd say the privileged syntax seems
justified here: it is does not represent just another attribute but
marks a completely separate element type.

I think this is an important consideration.
So is your point about avoiding English words in the syntax,
something we've been doing all along.

Another possibility would be to use another special
character, thus:

``` {.html key=value}
a code block

``` {!html key=value}
raw html

Another possibility would be to use another special character

yes, that would make much more sense I think.

I agree with @tarleb on privileged syntax being justified here since we are discussing a separate element type. An argument that everyone wants their own extension does not however seem entirely fair here. This generalized syntax would be covering a significantly wider group of use cases and would be on a higher level than lang attribute. Special prefix attributes (such as proposed lang) would come on top of this generalized raw blocks syntax (e.g. {text .xml :de-DE} or {table .csv :de-DE}). We can probably find an appropriate analogy for this case regarding lang attribute (or any other of this type) in HTML.

Choosing prefixed version {!html} would open the question of the upper/default/type level attribute. Prefixing means another level/axis of syntax and even though {!html} seems as a lesser evil compared to {raw='html'}, it feels wrong to not have a base type or base context indicator. The goal of this overloaded fenced code block syntax, as I understand the proposal of @jgm, is to create a typed nested context and it seems to me that the most appropriate way for this would be the version without a special character prefix (just {html}). In addition this would to a lesser extent also ease filtering and third party parsing.

{!html} seems like a very nice compromise to me, altough curly braces syntax might mislead users in thinking that attributes are supported. RawBlock could be wrapped in a div, of course, but that seems complex. How about the braceless !html?

I agree that the {!html} version is probably an acceptable compromise, but why would we want to make one here? It seems that there are many benefits and very few downsides to the version without the special character.

In relation to braceless !html, if the goal is to follow the simple path and overload the fenced code block syntax and get related functionality (various versions, attributes etc.) including backward compatibility for free, then we have to cover all existing options and ambiguities:

Block version

``` {!ms}
.MYMACRO
blah blah
```

and inline version would not be a problem

`!ms .MYMACRO ...`

Braceless version would lead to ambiguity since this:

```!ms
.MYMACRO
blah blah
```

would currently be interpreted the as this:

```{.!ms}
.MYMACRO
blah blah
```

We could set a priority handling for this case, parsing this version as first and the styled second as second if first one fails. Or we could change the behavior of shortcut form of fenced code block (e.g. requiring the dot for style). After putting down the examples above I must say that I am even more in favor of the version without a special character prefix.

+++ fmba [Mar 30 17 11:21 ]:

In addition
this would to a lesser extent also ease filtering and third party
parsing.

Really, it doesn't matter at all for filtering what the
syntax is. In the AST it will be

RawBlock (Format "html") string

I can't see that it would make much difference for
third-party parsing, either.

As for what should be done with {foo}, one possibility
we might consider is making this a key-value attribute
with an implicit boolean value (sort of like HTML does).

+++ Albert Krewinkel [Mar 30 17 11:40 ]:

{!html} seems like a very nice compromise to me, altough curly braces
syntax might mislead users in thinking that attributes are supported.

Oh, yes -- I see the worry there.

RawBlock could be wrapped in a div, of course, but that seems complex.
How about the braceless !html?

Yes, that seems better.

With the braceless !html, what would inline look like?
Would we allow

`<a>`!html

would currently be interpreted the as this:
{.!ms} .MYMACRO blah blah

I think we'd just impose a rule that "bare" language names
can't begin with !.

altough curly braces syntax might mislead users in thinking that attributes are supported.

True... I can also see the argument that this is it's own element and not a "mere" attribute. So maybe it should be:

``` !html
foo
```

and

`foo`!html

Then again, it's another syntax for users to learn and sticking to the familiar curly braces might be simpler...

@jgm: By third party filtering/parsing I was referring to the fact that whichever syntax will be chosen will become de-facto standard in may areas. Flavored markdown parsers that already support the syntax that we are discussing might have to be changed. The same goes for various hacks, dumb scripts etc. Not that it should influence the decision, I am only stating that if it should happen to be consistent with existing options, it would mean a headache less for the users/writters.

Den 2017-03-30 kl. 21:16, skrev fmba:

I agree that the {!html} version is probably an acceptable compromise, but why would we want to make one here?

Because one year from now some other way of overloading code block
syntax will come up, which seems clearly the obvious candidate for
unmarked {foo} to the person who wants that feature. There
already are filters which take codeblock contents, process them
with some external program like dot (graphviz), ditaa or a CSV
parser and inserts an appropriate element, e.g. an image or a
table, in place of the code block.
I initially thought that {.dot} would be the way of flagging
such a code block -- until I wanted to include an example of dot
code in the same document! I solved it by adding another class
.code, but but realized that I should support dot="OPTIONS" as
well.

FWIW !foo makes a Vim user think exactly "process this codeblock
with the foo program", so that -- or, I'm tempted to say, rather
"generate an image with the foo program and insert the markup
for embedding it where the code block used to be -- is probably
what the bang prefix should be used for in the future. Not that it
necessary clashes with the "raw markup" meaning, but perhaps we
should use another prefix for raw markup for that reason,
{@html} for example, to borrow yet another prefix with vaguely
similar meaning from CSS. N.B. That I'm all for the idea of
avoiding English keywords; it was I who once suggested that table
captions could be marked with : alone for just that reason.

/bpj

+++ fmba [Mar 30 17 12:45 ]:

Flavored markdown parsers that already support the syntax
that we are discussing might have to be changed.

Which flavors already support the {ms} form for raw
content? I hadn't seen that before.

altough curly braces syntax might mislead users in thinking that attributes are supported.

why wouldn't we want to support attributes here? don't underestimate the utility and flexibility of being able to pass key-value pairs to filters and writers as part of the attribute string...

can we have this "raw" attribute live alongside all the others?

```{#myID .class !html}
foo
```

altough curly braces syntax might mislead users in thinking that attributes are supported.

why wouldn't we want to support attributes here?

Because the RawBlock element doesn't take any attributes other than a language.

To me however braces don't signal "any attributes here" but "some attribute(s) here", and the language attribute of RavBlock and RawInline certainly is an attribute, albeit not an HTML-style attribute. I'd much rather keep the braces than running into ambiguities with `foo`!format which isn't followed by whitespace or punctuation.

It should be possible to have a letter rightr after a raw inline without anything intervening.

To me however braces don't signal "any attributes here" but "some attribute(s) here"

For me the opposite is true. If there are braces, then I assume that I can use an id, class, or arbitrary key=value there. Are there any other examples where braces are used, but not for a complete attribute string?

Perhaps, instead of changing the AST so that RawBlock can have attributes (this would be useful, but I'm not sure it's worth the cost), a different delimiter could be used?

Angled brackets?:

```<man>
foo
```

or inline `foo`<man>

why wouldn't we want to support attributes here?

Because the RawBlock element doesn't take any attributes other than a
language.

One might then ask, why not add attributes to this element?
But, leaving aside the general pain of AST changes, I see no
point in doing that. If you want a filter to act on
raw content with attributes, you can always use a code
block.

To me however braces don't signal "any attributes here" but "some
attribute(s) here", and the language attribute of RavBlock and
RawInline certainly is an attribute, albeit not an HTML-style
attribute. I'd much rather keep the braces than running into
ambiguities with foo!format which isn't followed by whitespace or
punctuation.

It should be possible to have a letter rightr after a raw inline
without anything intervening.

Yes, I agree, because we need this to work inline, braces
are probably a good idea. (Or something else, like square
brackets, but square brackets already have pretty
well-defined uses in inline contexts.)

Angled brackets already used for raw HTML.

Angled brackets already used for raw HTML.

That prompted me to think about what I mean with "curlies and attributes belong together", and I realized that the different kinds of brackets already have rather well defined meanings:

  • Angle brackets: hypertext -- raw HTML or linked literal URL.

  • Square brackets: element with text and attributes -- link/image/span/citation.

  • Curly brackets: attributes -- and I thought it would be better to extend that to non-html attributes than to attach the attributes semantics to yet another delimiter, but

  • Parentheses: element specific attributes -- they arguably have this meaning with links and images. This in contradistinction to more general attributes in braces.

So the syntax we should use for raw content is rather something like this:

````` (*html)

<tag>(*html)

````````

@bpj well said. I like how the parenthesis markup looks next to the back ticks too. It's unique enough to stand out as a something different (raw block) but would gracefully fail in other markdown parsers as just a parenthetical following a code block.

I think I like the parentheses idea, though I'm not sure why the * should be there.
In the rare case where you want parentheses right after inline code, you could always escape.

So we'd have

`` (ms)
.2C

and text with `\*[special "foo"]`(ms).
`````

Note the connection with the syntax I proposed for raw metadata items in #2139:
header-includes:
  - (latex):  '\raw{latex}'
    (html):   '<raw>html</raw>'
    (_):      'raw fallback'

```

I don't have anything in particular against the proposed (ms) syntax... it's just that it's yet another syntax to learn. Also, the argument for having internationalisation built in is kind of weak in this case. `<iframe>`{format=html} already requires extensive usage of English keywords like html and iframe, so adding another one (format or raw) doesn't make much of a difference.

P.S. If we have a raw block and inline syntax (as discussed here), we wouldn't need the YAML objects with special keys for raw metadata items anymore (as discussed in #2139), right?

+++ Mauro Bieg [Apr 03 17 02:31 ]:

I don't have anything in particular against the proposed (ms) syntax...
it's just that it's yet another syntax to learn. Also, the argument for
having internationalisation built in is kind of weak in this case.
<iframe>{format=html} already requires extensive usage of English
keywords like html and iframe, so adding another one (format or raw)
doesn't make much of a difference.

That's a good point; raw content is likely to use English
words anyway, so maybe it's not so bad to use {raw=html}.

P.S. If we have a raw block and inline syntax (as discussed here), we
wouldn't need the YAML objects with special keys for raw metadata items
anymore (as discussed in [1]#2139), right?

Very good point!

I think I like the parentheses idea, though I'm not sure why the * should be there.

The * signals "raw content" to leave the door open for using other punctuation characters after the opening parenthesis with other meanings. If @jgm doesn't mind I'll post some of my ideas for such syntaxes -- file includes and invoking e.g. dot to generate an image from the content of a code block -- to pandoc-discuss. (Surely nothing which couldn't be done with a filter, but somewhat painfully.)

P.S. If we have a raw block and inline syntax (as discussed here), we wouldn't need the YAML objects with special keys for raw metadata items anymore (as discussed in #2139), right?

I use my filter mentioned earlier in this thread in header-includes items for example and it works well.

Thinking about it a bit more, I'm liking
```` f[I]`{=ms}

{=ms} .MYMACRO
`````

I am considering something similar for MultiMarkdown v6, and also arrived at the use of code blocks/spans as being a reasonable syntax.

Obviously, @jgm can do whatever he likes with pandoc, but I agree with @iandol that if there was a common syntax, that would be a plus.

My thoughts, at least as pertains to MMD:

  1. I'm less enamored with the {=foo} syntax. To me, it just "looks" ugly. Either (foo) or [foo] looks better to me. I realize this is subjective. Perhaps it's because I don't use the {#foo} attributes syntax for MMD. I assume that just using {foo} would be problematic?

  2. To raise the question: Why is the {=ms} inside the fenced code block, but outside the code span? I'm not necessarily opposed to it, but an interesting benefit of being consistent between the two is that a fenced code block falls back to a code span in variants that don't support them. It does seem worth considering that using the same approach in both cases could have useful side effects.

  3. As to the * syntax example above, I would anticipate using * as a wildcard format marker in MMD, just like I do in file transclusion. I'm not sure exactly why one would need to use the same raw output in multiple formats, but I'm sure someone will eventually need it, and I see no reason not to support it from the beginning.

Welcome to the discussion @fletcher! As you noticed, the curly-braces-proposals in this thread are mostly informed by the existing attribute syntax, which pandoc shares with PHP Markdown Extra, among others. see e.g. the commonmark discussion...

I assume that just using {foo} would be problematic?

I think so, see above.

Why is the {=ms} inside the fenced code block

That's where the attribute block for code blocks has been for a long time (in pandoc and PHP Markdown Extra). I guess historically this was inspired by the much more widespread:

```python
 x = 7
```

That's where the attribute block for code blocks has been for a long time (in pandoc and PHP Markdown Extra). I guess historically this was inspired by the much more widespread:

Actually not. The PHP Markdown Extra developer and I hashed out a syntax for the fenced code blocks (originally with ~~~ delimiters) on the markdown-discuss mailing list before GitHub started using the ``` delimiters and allowing the bare language name.

Why is the {=ms} inside the fenced code block

That's where the attribute block for code blocks has been for a long time (in pandoc and PHP Markdown Extra).

I'm not really asking why it's inside the code block. I'm asking why the two are handled differently, and whether that is really the best way to go? There may be some advantages to being consistent between the two forms.

I'm not really asking why it's inside the code block. I'm asking why the two are handled differently, and whether that is really the best way to go? There may be some advantages to being consistent between the two forms.

I actually don't think of the attributes as being "inside the fenced code block." Rather, the attribute is part of the opening fence. Inside = the lines between the two fences. I think putting the attribute with the opening fence rather than the closing fence makes sense, as you can see right at the beginning what the syntax is. Anyway, I don't think there's any point in debating this. A wide variety of implementations has supported this location for fenced code block attributes for years now. Changing now would break with well established tradition and break a large number of existing documents.

Of course, you're right that we could have tried to make the inline case parallel:

`{.class}code`

instead of

`code`{.class}

I don't know if there's a compelling argument here, but to my eye, the first case is harder to read, because there's nothing separating the opening fence + attribute from the code itself. (Contrast the fenced block case, where you have a newline and it's completely obvious where the code starts.)

Now that we allow attributes for links, images, and generic spans, using the very same syntax, we also get a nice uniformity across those four cases (inline code, links, images, spans).

I agree that second case is more easily read than the first. Given pandoc's pre-existing support for attributes, the second instance also fits into that same model.

I also agree that in the fenced code block version, it makes more sense to have the information at the top than at the bottom.

Since MMD doesn't support attributes like this, I have a bit more flexibility in choosing a new syntax for this use case. Convergence would be nice, but is not necessary, I suppose.

Just FYI -- I pushed a commit to the development branch of MultiMarkdown 6 to add this feature, using the proposed {=html} syntax, in the hopes that it will be compatible with what is added to pandoc. I'll try it out and take feedback before considering it final.

Discussion: https://github.com/fletcher/MultiMarkdown-6/issues/38

As an unsophisticated pandoc user, I am glad about {=format}. I don't have to learn another kind of brace that occasionally follows inline code. And the = sign is not some HTML or CSS thingy, since I know a=b is common in that notation. So I don't need to consult a cheat-sheet if I just want to read the thing and skip over it. (I'll also note, as evidence of intuitiveness, that {=format} was the first solution I thought of before reading this thread.)

@fletcher : thank you for taking the interoperability factor into account! I also agree with @rose00
in that the {=format} syntax has an intuitive appeal to it.

@iandol I'm always happy to consider interoperability, it's just that it's not the most important factor to me if I believe another project has "done it wrong." ;) In this instance there didn't seem to be an alternative that was significantly better, so the compatibility factor tipped the scale for me. That's assuming @jgm ends up sticking with this as a preference, of course...

OK, let's go with the {=ms} in pandoc too.
(Just recording this decision so I don't have to reread the whole thread again to start implementing it.)

Closed by 2b34337a9cf8b025914e8219498b4c0258772be0

Does this commit take care of the fact that in some cases the expected raw markup language attribute and the output format name don't coincide, e.g. output format html5 (and several others) include html raw markup elements. I had to introduce special code for that in my filter when I started to use HTML 5, and I guess quite a lot of users will be bit by this unless html5 etc. as attributes do what they mean. Perhaps even xhtml should be covered.

+++ Benct Philip Jonsson [Jun 23 17 11:05 ]:

Does this commit take care of the fact that in some cases the expected
raw markup language attribute and the output format name don't
coincide, e.g. output format html5 (and several others) include html
raw markup elements. I had to introduce special code for that in my
filter when I started to use HTML 5, and I guess quite a lot of users
will be bit by this unless html5 etc. as attributes do what they mean.
Perhaps even xhtml should be covered.

As now documented in the manual, you need to use the
official names (matching what you'd use with --to option).

I've fixed the HTML writer so that it will work with either
html, html4, or html5. (html will work with either
html4 or html5 output; html4 and html5 are
specialized.)

As now documented in the manual, you need to use the official names (matching what you'd use with --to option).

This is not the case, at least for docx (->openxml).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ocehugo picture ocehugo  路  3Comments

transientsolutions picture transientsolutions  路  3Comments

cnblogs-dudu picture cnblogs-dudu  路  5Comments

chrissound picture chrissound  路  4Comments

tolot27 picture tolot27  路  5Comments