Pandoc: Feature request: Ability to specify a YAML metadata file for all reader types

Created on 19 Feb 2015 · 29Comments · Source: jgm/pandoc

An idea I thought would be useful. Many of the Readers have little or no way to set metadata, and the -M option on the commend line only accepts strings and not arbitrary YAML. Right now you can include a separate file of YAML metadata for Markdown formats (which is simple concatenated with the markdown files during parsing.)

This idea would be to specify a metadata file on the command line ("pandoc -Y metafile.yml" or something), which would be parsed separately and the contents added to the document metadata, regardless of the input file type.

Thanks

enhancement

Source

jasonseeley

👍4

Most helpful comment

What you want is already doable by passing the defaults containing YAML file as the last argument:

pandoc -f markdown -s your-input-file.md defaults.yaml

Metadata definitions seen first are kept and left unchanged, even if conflicting data is parsed at a later point.

tarleb on 25 Aug 2016

👍6

All 29 comments

When this gets suggested, people generally suggest to use Makefiles.

mpickering on 20 Feb 2015

One place where this would be useful is to allow the reading of YAML bibliography files. At the moment (unless I'm mistaken), a YAML references file (as you might insert in the document directly) is not an accepted input format for pandoc-citeproc. This makes it difficult to share a YAML bibliography file between multiple documents. If pandoc could bring in arbitrary YAML files, then both documents could bring in the same YAML bibliography file.

Note: it is desirable (to me) to use the YAML bibliography because it supports URL references better than other formats.

meowsqueak on 18 Mar 2015

If YAML bibliographies are not accepted by pandoc-citeproc,
they could certainly be added. Open a ticket on jgm/pandoc-citeproc,
once you've confirmed that they don't work (I can't recall whether
they do).

jgm on 19 Mar 2015

I double-checked and it seems that external YAML files are supported via both the --bibliography option and via inline YAML "bibliography: " when --filter pandoc-citeproc is specified, provided the YAML file is correctly formatted - there is no warning or error if it's not formatted correctly.

I raised this point because this is not documented as supported at http://johnmacfarlane.net/pandoc/README.html in the citations extension section (doc has no anchors).

So would this suggest I open both a documentation ticket against pandoc, and a ticket against pandoc-citeproc for the lack of warnings? Or is pandoc itself suppressing them?

meowsqueak on 22 Mar 2015

So would this suggest I open both a documentation ticket against pandoc, and a ticket against pandoc-citeproc for the lack of warnings?

Yes, that sounds right. And this can then be closed.

jgm on 13 Apr 2015

I'd still prefer to write metadata in YAML over XML/Dublin Core. Wouldn't this be possible to parse even when LaTeX or something else is the input format?

DivineDominion on 13 Nov 2015

+1 for this. If it's too hard to parse in the input document, an easier option could be to specify --metadata-file=XXX. At the moment the other options only allow you to specify a string value for a specific key; this file would allow us to set complex (nested) values for arbitrary keys.

This is useful when using non-default templates. For example, I am trying to generate docbook output from rst input, and adding <author><affiliaton><address><email>..</**> into the existing <articleinfo> is pretty hard.

infinity0 on 5 Dec 2015

Allowing for an additional YAML meta-data file would bring one problem: Which markup would be allowed for text in the YAML file? I see three possibilities, neither of which I really like:

Always use Markdown. Inconvenient and unexpected for people used to other formats.
Use the same format as the reader: would make this feature close to useless for epub, docx, etc.
Allow to specify the format for the meta-data file separately. Likely to be complex and unintuitive.

tarleb on 20 Aug 2016

What about 2 but do 1 for formats like epub, docx etc?

vyp on 20 Aug 2016

Let's call that 4. Though better than the other three, it would still be inconvenient for users unfamiliar with Markdown.

tarleb on 20 Aug 2016

Why not just do (2)? If it's useless for epub/docx people would simply not use it for that format. It still helps with the other formats, though.

infinity0 on 20 Aug 2016

👍3

The benefit of option (2) is that it's consistent. I experimented a bit, here is a proof of concept for an equivalent but slightly different approach: How about simply supporting the yaml_metadata_block extension for more readers? The linked code implements it for the org reader, but with the restriction that the YAML block is allowed at the top of the document only. The approach builds on existing options and is basically identical to (2) as one can simply cat the two files together.

tarleb on 20 Aug 2016

👍2

Extended PoC, adding YAML support to Org, RST, and LaTeX.

tarleb on 20 Aug 2016

No opposing opinions have been voiced yet, so I opened PR #3084 for this.

tarleb on 22 Aug 2016

What about the priority of which metadata variable to use when the same variable is specified twice: I want to have a default YAML meta-data file for all conversions, but if there is YAML metadata in the source file (markdown), then that gets priority. So for example define a standard mainfont in metadata.yaml but this could be overridden using the YAML block in the source.md file when converting to LaTeX?

iandol on 25 Aug 2016

What you want is already doable by passing the defaults containing YAML file as the last argument:

pandoc -f markdown -s your-input-file.md defaults.yaml

Metadata definitions seen first are kept and left unchanged, even if conflicting data is parsed at a later point.

tarleb on 25 Aug 2016

👍6

OK, I had tried that previously and it never worked, but just found that it was a small error in my Markdown file (my last line was a figure block and not terminated with a newline which caused the yaml to become appended as plain text). Adding a couple of newlines and yaml is now correctly parsed with the document metadata correctly taking precedence. Thank you!

iandol on 26 Aug 2016

I came across this issue from pandoc-discuss, and I find a way that kind of work currently. The idea is to convert the source yml and source document to native first and cat them together (plus a little detail):

pandoc -f markdown -t native -s metadata.yml | sed '$ d' > metadata.native
pandoc -f <fromFormat> -t native -o document.native document.<fromFormat>
pandoc -f native -t <toFormat> -s -o document.<toFormat> metadata.native <document>.native

The extra detail is the sed, because the metadata.yml is regarding as a markdown document with no body, so the last line of the file is [], which you need to remove. Another way of removing it is head -n -1 (would not work on Mac's default head). From my test it seems the meta in native is always in one-line, if true then head -n1 will work (which also works on Mac).

Any cli options should be added to the last line only (to avoid having extra metadata somewhere else).

This approach is kind of hacky since metadata.native can only contain meta and document.native cannot contain one. And the syntax in native is not well-known so I'm not sure if there's any other gotcha.

But it seems this is the only currently working method (alternatively one can convert the document to markdown first and cat from there, but the extra conversion can introduce extra loss.)

Edit: Fixed some typos and add some more comments:

this is basically @tarleb's (1), while working now. The 3 lines are long, but a thin wrapper using shell script or a makefile can hide them away.
Unix only, since I used the shell. But the idea should be applicable to Windows too.
The script above is a sketch. But I tested the idea on real documents to verify it works.

ickc on 17 Jan 2017

Why must the text in a metadata file necessarily be interpteted as any format rather than as plain text?

One alternative would be to make a top-level field like _metadata_format: markdown 'magic'.

bpj on 18 Jan 2017

+++ Benct Philip Jonsson [Jan 18 17 12:11 ]:

Why must the text in a metadata file necessarily be interpteted as any
format rather than as plain text?

Well, if you're writing an abstract for example, it's nice
to be able to include formatting. If you have a title with
math in it, you'd probably like to include math. Plain text
is too limiting.

One alternative would be to make a top-level field like
_metadata_format: markdown 'magic'.

This is prioritizing the less common case over the more
common one. People will find it confusing if this
makes emphasis in the body of the text but not the title.

Better to provide a special way to create a "raw"
metadata field when this is needed. See #2139 for that.

jgm on 19 Jan 2017

I made the 3 commands I suggested above shorter. This require bash though (using process substitution).

YAML=metadata.yml; INPUT=document.md; OUTPUT=document.pdf
pandoc -f native -s -o $OUTPUT <(pandoc -f markdown -t native -s $YAML | sed '$ d') <(pandoc -t native $INPUT)

I will later add it to Pandoc Tricks · jgm/pandoc Wiki.

ickc on 28 Jan 2017

See https://groups.google.com/d/msg/pandoc-discuss/6KLbZk7NVWk/0XMWewhLCQAJ
for a way to do this using lua filters.

jgm on 17 Nov 2017

Which markup would be allowed for text in the YAML file?

It could be argued that if you want to use a specific format to specify metadata, you should use that format's metada block syntax inside the document (e.g. .. meta:: for RST). If that doesn't work for you for some reason, you can use an external YAML file but at that point you have to learn both YAML and markdown. This would at least keep this mechanism simple and predictable.

If you absolutely must, you can also use generic raw snippets and use whatever syntax you like inside "markdown".

mb21 on 17 Nov 2017

I stand by my last comment: let's introduce a --metadata-file option that takes a YAML file (or JSON file, determined by file suffix) where the strings are interpreted as markdown. (Definitions in the file have lower priority than the ones inside the document, solving #3115.)

We can always add more things later, like:

parsing .. meta:: in RST or <meta> in HTML (which would act analogous to the current YAML metadata blocks in markdown)
adding an additional option that specifies the markup language the metadata is interpreted as (overriding the default which would be set to markdown).

mb21 on 27 Mar 2018

👍2

I think I like @mb21's suggestion. It's simple, and it would help in some of the practical cases described above.

jgm on 27 Mar 2018

Re: the thread, I have been using gfm+yaml_metadata_block and passing in a .yml file in the inputs. Or, I use --include-in-header=$file.tex.
@jgm re: use-cases, a very common one you can find a lot of instances of on forums, TeX StackOverflow and so on is the ability the text-wrap code in fenced code fields, as well as apply other styling information to it, such as line numbers. This is highly desirable in many different kinds of documentation, but there is currently no practical way to do it.

What I would like to see is the equivalent of stuff like --variable urlcolor=$color for more/all LaTeX options (at least the styling ones), or, as mentioned above, the ability to pass through custom LaTeX options more easily than is currently possible.

A problem with the JSON/YML solution is it is more technical than a lot of users require, and so a lot of people would simply give up and move to another solution than continuing to fiddle with Pandoc arguments and config files.

ssolidus on 29 Mar 2018

@jgm re: use-cases, a very common one you can find a lot of instances of on forums, TeX StackOverflow and so on is the ability the text-wrap code in fenced code fields, as well as apply other styling information to it, such as line numbers. This is highly desirable in many different kinds of documentation, but there is currently no practical way to do it.

Sorry, I didn't understand this comment or what it has to do with the topic of this thread.

What I would like to see is the equivalent of stuff like --variable urlcolor=$color for more/all LaTeX options (at least the styling ones), or, as mentioned above, the ability to pass through custom LaTeX options more easily than is currently possible.

This is just a matter of template design. You can always create a custom template that allows you to control some LaTeX option with a variable. And you can also propose modifications to the default template along these lines.

jgm on 29 Mar 2018

I had a quick look at implementing this, but unfortunately the YAML parsing is quite intertwined with the rest of the Markdown reader.

This is due to the fact that we share state between the YAML metadata block and the rest of the markdown document (I'm guessing for footnotes etc?). This is not going to happen when the YAML is read in from an external file and merged with the document metadata after the reader has produced a Pandoc Meta [Block], and it wouldn't work for other input formats anyway. Thus we'll just have to make users aware that there's a small difference between pandoc --metadata-file m.yaml input.md and pandoc m.yaml input.md.

Still, we have a choice:

either we refactor the existing YAML parsing and export it as a function from the Markdown reader: PandocMonad m => yamlToMeta :: Yaml.Value -> m Meta (or even one taking a ByteString so we could reuse the decoding with error handling). Then all Strings in the YAML metadata file would share one markdown reader state.
or we reimplement the actual YAML parsing somewhere else and apply readMarkdown to each String individually, in which case state wouldn't be shared (and possibly a few more inconsistencies might pop up).
Finally, we could even do (2), but also use the new implementation in the Markdown reader. This seems the cleanest solution (especially if we'd want to parse other syntax than markdown in the future), but possibly might break some existing documents in subtle ways?

I'm unsure what the implications of (not) sharing ParserState are, in practice, with regard to markdown parsing...

mb21 on 31 Mar 2018

👀1

Mauro Bieg notifications@github.com writes:

block and the rest of the markdown document (I'm guessing for
footnotes etc?).

Yes, exactly.

This is not going to happen when the YAML is read in
from an external file and merged with the document metadata after the
reader has produced a Pandoc Meta [Block], and it wouldn't work for
other input formats anyway. Thus we'll just have to make users aware
that there's a small difference between pandoc --metadata-file m.yaml input.md and pandoc m.yaml input.md.

Agreed.

Still, we have a choice:

either we refactor the existing YAML parsing and export it as a
function from the Markdown reader: PandocMonad m => yamlToMeta :: Yaml.Value -> m Meta (or even one taking a ByteString so we could
reuse the decoding with error handling). Then all Strings in the YAML
metadata file would share one markdown reader state.

This seems simplest to me, and I don't see a drawback to sharing
state. This way, for example, you could define footnotes and
reference links within the yaml metadata file. Of course they'd
only work within that file, but still people might expect they
can do this. Is there a downside?

Finally, we could even do (2), but also use the new implementation
in the Markdown reader. This seems the cleanest solution (especially
if we'd want to parse other syntax than markdown in the future), but
possibly might break some existing documents in subtle ways?

If other syntaxes are the issue, then we might try to decouple
the markdown-specific parts of the function from the parts that
deal with YAML. Perhaps the reader could be passed in as a
function? Maybe we could do this in such a way that we
don't hard-code use of ParserState?

jgm on 31 Mar 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings