Pandoc: Feature request: use variables in document body text

Created on 17 Feb 2015 · 23Comments · Source: jgm/pandoc

variables.yaml:

protagonist:
- first: Ishmael
antagonist:
- first: Moby-Dick
- classification: whale
- colour: white
- possessive: #{protagonist.first}'s

---

(Only back references allowed for one-pass parsing.)

chapters/1.md:

"Call me #{protagonist.first}. I won't rest until I've mounted #{antagonist.possessive} fluke on my roof. That giant #{antagonist.color} fish of the sea is my nemesis."

Then pandoc variables.yaml chapter/1.md would write the following HTML to stdout, with the variables from the markdown file substituted using the values from the YAML file:

<html>
<body>
<p>
"Call me Ishmael. I won't rest until I've mounted Moby-Dick's fluke on my roof. That giant #{antagonist.color} fish of the sea is my nemesis."
</p>
</body>
</html>

Since color couldn't be found (due to the variable name being _colour_), no variable substitution is made. To stderr, a listing of all missing variables:

#{antagonist.color}

If this is already possible with pandoc, please link to the documentation showing a clear example for how to accomplish this task (without using templates, as they are inappropriate for this situation).

Ideas on how to write a preprocessor for markdown documents (that could then be piped to pandoc) are also quite welcome.

enhancement Markdown reader

Source

DaveJarvis

👍14

Most helpful comment

+++ Dave Jarvis [Feb 16 15 19:20 ]:

If this is already possible with pandoc, please link to the documentation showing a clear example for how to accomplish this task (without using templates, as they are inappropriate for this situation).

Template expansion occurs only in template, not in body text.

However, nothing stops you from using a Markdown file as a template
for itself. Take this my.md:

hello:
  english: world
  german: Welt
...

Hello $hello.english$.

Now do

% pandoc my.md --template my.md | pandoc -t html
<p>Hello world.</p>

This is a bit roundabout, admittedly. But it works.

jgm on 17 Feb 2015

👍11

All 23 comments

+++ Dave Jarvis [Feb 16 15 19:20 ]:

If this is already possible with pandoc, please link to the documentation showing a clear example for how to accomplish this task (without using templates, as they are inappropriate for this situation).

Template expansion occurs only in template, not in body text.

However, nothing stops you from using a Markdown file as a template
for itself. Take this my.md:

hello:
  english: world
  german: Welt
...

Hello $hello.english$.

Now do

% pandoc my.md --template my.md | pandoc -t html
<p>Hello world.</p>

This is a bit roundabout, admittedly. But it works.

jgm on 17 Feb 2015

👍11

Thank you jgm: that's an interesting work around and a good idea, given the constraints. Iterating over multiple chapters makes the problem a bit more difficult. A small shell script that first combines the variables with each chapter is useful:

#!/bin/bash
OUTDIR=output
rm -f $OUTDIR/*
mkdir -p $OUTDIR

for i in chapter/*.md; do
  out=$OUTDIR/$(basename $i);
  cat variables.yaml $i > $out;
  pandoc $out --template $out | \
    pandoc -t context > $OUTDIR/$(basename $i .md).tex;
done

This way the variables can be saved in a single file, without having to reference the file in every chapter. That said, the following would be a simpler, cleaner, and much more robust solution:

pandoc --variables variables.yaml chapter/1.md -t context -o chapter/1.tex

Piping the combined variables and chapters directly to pandoc won't work because the --template option cannot read from standard input.

DaveJarvis on 17 Feb 2015

I like the power of pandoc!
/cc @jgm Would it be helpful to print an error/warning if a variable's value cannot be found?

/cc @DaveJarvis
I took your example as an exercise - let me know whether it'll work :)

Expand variables file first (if it uses variables self):

pandoc variables.yaml --template variables.yaml > var-exp.yaml

Can we use _xargs_ instead of script?

ls chap* | xargs -I file pandoc --template file var-exp.yaml file | pandoc -t context

nkalvi on 17 Feb 2015

+++ nkalvi [Feb 17 15 06:53 ]:

I like the power of pandoc!
/cc @jgm Would it be helpful to print an error/warning if a variable's value cannot be found?

No, because in lots of templates we test for a variable being set with an "if". Printing warnings would generate lots of spurious warnings.

jgm on 17 Feb 2015

/cc @jgm That's what I thought why it wasn't done. Thanks.

nkalvi on 17 Feb 2015

No, because in lots of templates we test for a variable being set with an "if". Printing warnings would generate lots of spurious warnings.

It is possible to filter warnings. For example:

pandoc --stderr=variables,conversion,formatting ...

If only variable-related errors are desired, then:

pandoc --stderr=variables ...

That said, why is testing for a variable being set repeated throughout the code? Shouldn't all the code rely on a single function so that variable tests are performed in one spot?

What would it take to track of referenced variables that could not be found, then list those (and the context) that couldn't be dereferenced? For example:

warning variables.yaml: $antagonist.color$ not found

For variables from standard input:

warning stdin: $antagonist.color$ not found

DaveJarvis on 18 Feb 2015

+++ Dave Jarvis [Feb 17 15 17:52 ]:

No, because in lots of templates we test for a variable being set with an "if". Printing warnings would generate lots of spurious warnings.

That said, why is testing for a variable being set repeated throughout the code?

Not throughout the code. This is all handled in the Templates module.
My point was that many templates have variables that may or may not be
set, and this is a useful feature. So the suggested warning would
trigger many spurious, non-useful warnings.

jgm on 18 Feb 2015

Den 2015-02-17 04:20, Dave Jarvis skrev:

variables.yaml:
protagonist:
- first: Ishmael
antagonist:
- first: Moby-Dick
- classification: whale
- colour: white
- possessive: #{protagonist.first}'s
---
(Only back references allowed for one-pass parsing.)

chapters/1.md:
"Call me #{protagonist.first}. I won't rest until I've mounted #{antagonist.possessive} fluke on my roof. That giant #{antagonist.color} fish of the sea is my nemesis."
Then pandoc variables.yaml chapter/1.md would produce to stdout:
"Call me Ishmael. I won't rest until I've mounted Moby-Dick's fluke on my roof. That giant #{antagonist.color} fish of the sea is my nemesis."
Since color couldn't be found (due to the variable name being _colour_), no variable substitution is made. To stderr, a listing of all missing variables:
#{antagonist.color}
If this is already possible with pandoc, please link to the documentation showing a clear example for how to accomplish this task (without using templates, as they are inappropriate for this situation).

Ideas on how to write a preprocessor for markdown documents (that could then be piped to pandoc) are also quite welcome.

I use [Template::Toolkit][] to do this among other things,
including reading variables from a YAML file, having written my
own commandline wrapper script -- which I'll share if you are
interested -- which can either read in one set of variables and
apply them to several templates/documents or read in several sets
of variables and apply them in turn to the same document template.
Unfortunately the commandline wrapper which comes with TT can't
read variables from files, and the only other publicly available
wrapper which can has some issues with the current version of TT.
You can use any tag delimiters you want with TT on a per document
basis, even regular expressions, but if the tag delimiters are
e.g. {% and %} TT sees all instances of those characters, or
all matches against the regular expression, as tag delimiters, so
you can't use something which clashes with regular Pandoc syntax
like {# or } (It would be an _extremely_ bad idea to use a
single curly pracket as tag delimiter!) but e.g.
#{protagonist.first}# which is close to your preferred syntax
would work.

I usually use double backticks around curly brackets as tag delimiters

 ``{protagonist.first}``

because the template tags will then stand out as 'code' if I
render the doc with pandoc without running it through TT (for
proofing), and if I actually need a multi-backtick code span which
begins/ends with braces I just put a space between the backticks
and the bracket:

 `` { } ``

Pandoc will see a code span beginning and ending with curly
brackets in both cases, but TT won't see the latter as tag delimiters.

bpj on 20 Feb 2015

nkalvi:

ls chapter/* | xargs -I file pandoc --template file variables.yaml file | pandoc -t context

Good idea, but it doesn't quite reproduce the same output as the script. Also, running the variables through itself is a nice way to help resolve references.

bpj:

which I'll share if you are interested

I appreciate the offer and will let you know if the scripts start to become a time-waster. The only part that remains unsolved is the ability to know when a missing/non-existent tag is used. If there was a feature that prevented pandoc from substituting empty strings for undefined variables, then it'd be easy to grep the output for variables that were not dereferenced.

DaveJarvis on 21 Feb 2015

I've written a Java application that resolves these issues and more.

https://bitbucket.org/djarvis/yamlp

DaveJarvis on 23 Sep 2016

@jgm Apologies for digging up your 2 year old comment, but I liked this solution you suggested:

However, nothing stops you from using a Markdown file as a template for itself.

Yet I'm finding that having inline math prevents me from using a Markdown file as a template for itself. Adapting your example, take this my.md:

hello:
  english: world
  german: Welt
...

Hello $hello.english$. Did you know $1+1=2$?

Now do:

$ pandoc --template my.md my.md | pandoc -t markdown
pandoc: "template" (line 7, column 38):
unexpected "1"
expecting letter
CallStack (from HasCallStack):
  error, called at src/Text/Pandoc/Templates.hs:73:35 in pandoc-1.19.2.1-J1nmFBg9ln971v0RrPbKLJ:Text.Pandoc.Templates

I suspect I should handle this by using a template processor like Mustache or Liquid to preprocess the markdown, instead of the workaround that uses the markdown file as a template. But I thought I'd see if you had an alternative suggestion/workaround first 😄

michaelstepner on 13 May 2017

Define the calculation in YAML. For example:

  game:
    played:
      first: $date.protagonist.born$ - 672

Then reference the YAML variable within the document.

DaveJarvis on 13 May 2017

Define the calculation in YAML.

@DaveJarvis, my goal is to typeset an equation in LaTeX/MathJAX, not perform a calculation. But your suggestion was a good idea.

michaelstepner on 14 May 2017

I'm reopening this as a feature request. Note that multimarkdown supports this under the name Metadata “Variables”. For example:

---
my name: John Doe
---

Best regards, [%my name]

Yes, weirdly you can put a space in there (and no, there is no way to access nested values).

Something like this could be easily implemented in the markdown reader, or just as a pandoc filter. Thoughts @jgm?

mb21 on 7 Oct 2018

👍4

@mb21: The pandoc-mustache filter that I've written satisfied my desire for this feature. (Although it may not satisfy everyone's needs!) Here's an example, pasted from the README for pandoc-mustache:

Example

This document, in document.md:

mustache: ./le_gaps.yaml
---
The richest American men live {{diff_le_richpoor_men}} years longer than the poorest men,
while the richest American women live {{diff_le_richpoor_women}} years longer than the poorest women.

Combined with these variable definitions, in le_gaps.yaml:

diff_le_richpoor_men: "14.6"
diff_le_richpoor_women: "10.1"

Will be converted by pandoc document.md --filter pandoc-mustache to:

The richest American men live 14.6 years longer than the poorest men, while the richest American women live 10.1 years longer than the poorest women.

michaelstepner on 7 Oct 2018

👍4

(Although it may not satisfy everyone's needs!)

There are a few key aspects that would make this feature more versatile:

Filename. Provide the name of the file containing variables on the command line. Such as:
- pandoc document.md --filter pandoc-mustache variables.yaml
Delimiters. Ability to define the start and end token delimiters, as hard-coding is an unnecessary restriction. See:
- {{...}} - Assemble, url">Handlebars, and others. Seems to be gaining popularity.
- #{...} - Aaron Parecki
- $(...) - Jekyll and Julia?
- ${...} - Scrivenvar, explicit bash variables, and Apache Camel REST services configuration
- `r#x(v$...)` - Scrivenvar's R Markdown (.Rmd) expressions (see knitr)
- [%...] - Multimarkdown?
- $...$ - pandoc variables
- +r ...+ - Knitr/AsciiDoc
- Implies:
  - pandoc document.md --filter pandoc-mustache -f variables.yaml -delim '${' -delim '}'
String interpolation. This YAML preprocessor first performs recursive string interpolation before attempting to substitute back into the document. The algorithm is a trivial 8 lines of code, once the data structures are defined.

See: https://github.com/michaelstepner/pandoc-mustache/issues/5

DaveJarvis on 7 Oct 2018

@DaveJarvis The pandoc-mustache filter is certainly quite barebones (but also quite useful to me). Anyone interested in improving it should check out the Contributing section of the README.

Further discussion of pandoc-mustache feature requests should probably be posted to the pandoc-mustache repo rather than this issue.

michaelstepner on 7 Oct 2018

👍1

There's actually a sample lua filter in the docs for doing just this:
https://pandoc.org/lua-filters.html#replacing-placeholders-with-their-metadata-value
It could be modified to use the [%my name] syntax.

Note that this would not expand variables in the same
way as pandoc templates (which allow things like author.last_name)
and would not include the control structures of pandoc-templates.

jgm on 7 Oct 2018

👍3

There's actually a sample lua filter in the docs for doing just this:

It's pretty close and an excellent example, but has practical shortcomings, some easier to resolve than others:

Escaped dollar symbols. Having to escape the $ signs is not directly compatible with pandoc's existing ability to parse YAML variables by piping pandoc through pandoc.
Interpolation. It seems this is an arduous feature to implement and there are a number of edge cases.
Namespaces. No dot-notation for organizing variables is supported.

Using lua makes calling pandoc simpler. For example, compare the following invocations:

cat *.md > body.md
pandoc body.md --lua-filter=variables.lua \
  --metadata-file=interpolated.yaml -t context > body.tex

# ...versus the equivalent....
cat interpolated.yaml > body.md
cat *.md >> body.md

pandoc body.md --template body.md --metadata pagetitle="unused" | \
    pandoc -t context > body.tex

Such simplifications using lua would make complex format conversions faster and easier to maintain (fewer lines of code).

Namespaces are quite helpful for organizing data in a meaningful way. Consider:

ice_make: "Lexus"
ice_model: "LS 430"
ice_year: "1991"

ice:
  make: "Lexus"
  model: "LS 430"
  year: 1991
ev:
  make: "Ford"
  model: "Focus Electric"
  year: 2019

The lua filter assumes a flat hierarchy of variable names (e.g., ice_make), which is understandable; however, the ice_ prefix is duplication that is best avoided to ease maintainability.

DaveJarvis on 1 Aug 2019

Maybe we could adjust the example lua filter jgm mentioned above, and make it a somewhat more official solution? Or do you think it's worth doing this in the markdown reader?

I agree with @DaveJarvis:

change the syntax to something else than dollars (as they're taken by math already). I'm fine with multimarkdown's [%author] (not sure that spaces should be allowed though).
allow dot notation like [%author.last_name]

P.S. Not sure what @DaveJarvis meant with "Interpolation".
P.P.S. I don't think we'd need the control structures (if, for, etc.) of pandoc-templates.

mb21 on 23 Aug 2019

P.S. Not sure what @DaveJarvis meant with "Interpolation".

See: https://en.wikipedia.org/wiki/String_interpolation

manufacturer:
  ford:
    name: Ford
ev:
  full: $ev.year$ $ev.make$ $ev.model$ 
  model: Focus Electric
  make: $manufacturer.ford.name$
  year: 2019

The value $ev.full$ resolves to 2019 Ford Focus Electric.

change the syntax to something else than dollars (as they're taken by math already). I'm fine with multimarkdown's [%author] (not sure that spaces should be allowed though).

Preferably it would work with any sigil or start/end token delimiters, provided by the user. My yamlp provides this facility using a regular expression; Red Hat Fuse also allows customizing start and end tokens; Apache Camel might also have similar functionality --- point being there's really little reason to hard-code the sigils when more flexible approaches exist.

The overall algorithm becomes:

Load and parse a Markdown document with YAML header.
Pass the YAML header through the string interpolation preprocessor (lua or otherwise).
Replace the original YAML header with the preprocessed header.
Apply the resulting YAML hierarchy to the Markdown document.
Transform the AST as per usual.

Having an option to preprocess and export YAML files alone would also be useful. For example, an empty Markdown document having no body but a YAML header. Like the following example.md file:

---
manufacturer:
  ford:
    name: Ford
ev:
  full: $ev.year$ $ev.make$ $ev.model$ 
  model: Focus Electric
  make: $manufacturer.ford.name$
  year: 2019
---

Then something like:

pandoc --lua-filter=preprocess.lua --lua-args "start-token='$' stop-token='$'"  example.md

Produces (note the lack of quotation marks for numeric values):

---
manufacturer:
  ford:
    name: "Ford"
ev:
  full: "2019 Ford Focus Electric"
  model: "Focus Electric"
  make: "Ford"
  year: 2019
---

With a default maximum of 20 substitutions per key. Any keys having variable references that are nested deeper than the maximum will result in the last (e.g., 20th) key name being substituted without any corresponding value. This prevents infinite loops in interpolated references. The number 20 is arbitrary, but could be configurable. Similarly, any key that has no reference remains as its placeholder name, such as:

key1: value1
key2: $missing.key$

The value of key2 resolves to $missing.key$ .

By processing the YAML header before pandoc parses the entire document, it prevents having to escape the dollar symbols (i.e., \$) or use a specific symbol set (e.g., [% and ]). My understanding is that pandoc -t context+tex_math_dollars allows the user control whether $ symbols are interpreted as inline math expressions.

Being able to configurable the variable path separator token (.) to use a user-specified value would offer greater flexibility. This would allow users to supply XPath-like references and other unconstrained possibilities, such as:

[%author/name/last]
${author.name.last}
$author>name>last$
{{author🠖name🠖last}}
`r v$author$name$last`
- Uses `r v$ to start, ` at end, and $ to separate (a contrived example based on R Markdown variables).

DaveJarvis on 24 Aug 2019

Thanks for the great write up @DaveJarvis! That technique served me pretty well for several projects.

I've since landed on one that didn't go very well, but I realized what I was trying to do was fundamentally different. I wasn't iterating over data so much as localizing content based on context. Hence I ended up with a frustrating mess of YAML 'data' that didn't quite make sense and it was unclear how to generate Markdown that had what I wanted.

In the end I realized that i18n tools were closer to what I needed, and I started pre-processing my content files with handlebars. By default that was not much different that the YAML data substitution approach using Pandoc templates talked about above, but it allowed me to write a helper application to do something different that just substitute data from a table. What I ended up with was handlebars-helper-fluent which wraps the Project Fluent i18n tools (specifically the JS toolkit) into a Handlebars helper. Now I can use both YAML data and FTL message files to provide content to inform my template. Once hbs-cli fills in all the blanks for me using ether it's own substitutions for the string data or Fluent for localization (or data transformations that are functionally _similar_ to translation), then the content gets passed to Pandoc.

Hopefully somebody else finds that helpful.

alerque on 12 Sep 2019

@DaveJarvis,

Sorry for entering this old topic, but using HTML inside an array does not work.

YAML:

---

gnome: 'gnome'

icones:
  - {nome: actions}
  - {nome: apps}
  - {nome: devices}
  - {nome: mimetypes}
  - {nome: places}
  - {nome: status}

mais:
  - {url: 'filename.com/$icones.nome$/logo=$gnome$'}

---

MD file:

$for(icones)$
  <img alt="$icones.nome$"   name="$icones.nome$"   src="https://$mais.url$"/>
$endfor$

It should like:

<img alt="actions"   name="actions"   src="https://filename.com/actions/logo=gnome"/>
<img alt="apps"      name="apps"      src="https://filename.com/apps/logo=gnome"/>
<img alt="devices"   name="devices"   src="https://filename.com/devices/logo=gnome"/>
<img alt="mimetypes" name="mimetypes" src="https://filename.com/mimetypes/logo=gnome"/>
<img alt="places"    name="places"    src="https://filename.com/places/logo=gnome"/>
<img alt="status"    name="status"    src="https://filename.com/status/logo=gnome"/>