Pandoc: Extension to treat first heading level as title?

Created on 23 Jun 2019  Â·  30Comments  Â·  Source: jgm/pandoc

I'm using Pandoc 2.7.3 to convert Markdown to AsciiDoc. In the output file, the level of every heading is 1 deeper than expected.

README.md:

# Title

## Level 2

Invocation:

pandoc -o README.adoc README.md

Output (README.adoc):

== Title

=== Level 2

Expected output:

= Title

== Level 2
enhancement

Most helpful comment

Maybe we need a markdown extension that treats a unique level-one header as the metadata title, and promotes all other headers. Something like +top_heading_as_title.

All 30 comments

The = is reserved for the document title, provided in metadata.
This is standard in asciidoc; see e.g. https://raw.githubusercontent.com/asciidoctor/asciidoctor.org/master/docs/asciidoc-syntax-quick-reference.adoc as an example.

Use this form:

title: Title
...

# First section

I'm not sure that I got my problem across. I realize that in AsciiDoc, = is the document title. The thing is that Markdown has no comparable syntax for defining the document title. As far as I know, the syntax you mention in your previous comment isn't standard Markdown. So in practice, # is often used in Markdown documents for the document title, then ##, ###, etc. are used for headings -- just like in AsciiDoc.

I have a large number of Markdown files that use this approach and that I'd like to convert to AsciiDoc. So I need to convert # to =, ## to ==, and so on. Is there a way to achieve this?

Well, pandoc uses pandoc flavoured markdown...

I was thinking you can could write a lua-filter to decrement all header levels by one, similar to this example: https://pandoc.org/lua-filters.html#modifying-pandocs-manual.txt-for-man-pages, but this causes the current pandoc version to crash on Prelude.init: empty list somewhere in the asciidoc writer.

You could change this line and compile pandoc yourself...

Specifically, the hierarchicalizeWithIds function in Pandoc.Shared breaks...

Or write a filter that decrements all heading levels by one, except for the heading 1, which should be set as the document meta data title.

Relevant StackOverflow question and answer: Pandoc: set document title to first title

@tarleb you're a lua wizard! I just grabbed another one of yours for numbered-chapter-reference ~(not enough karma for my votes to show yet)~

Maybe we need a markdown extension that treats a unique level-one header as the metadata title, and promotes all other headers. Something like +top_heading_as_title.

That would solve it for markdown.. but what about reading other formats like HTML?

We could make the extension affect HTML as well. I don't know if there are other formats that use this convention for indicating titles.

This issue is more vexing to me than expected...

It seems to me now that what we need is an extension that treats a unique level-one header as the metadata title, and does nothing to the other headers. Let me explain...

This markdown:

---
title: Title
---

## Level 2

converted with pandoc -s, results in this HTML structure:

<head>
  <title>Title</title>
</head>
<body>
  <header>
    <h1 class="title">Title</h1>
  </header>
  <h2>Level 2</h2>
</body>

which is usually what you want: one h1, one h2.

Now to the asciidoc and html readers:

If you're converting from pandoc markdown (same md as above):

```
% pandoc -t asciidoc -s
---
title: foo
---

## bar
^D
= foo

== bar
```

If you're converting from markdown with a different header convention (OP's md):

```
% pandoc -f markdown+top_heading_as_title -t asciidoc -s
# foo

## bar
^D
= foo

== bar
```


Similarly from HTML; currently we have:

```
% pandoc -f html -t markdown --atx-headers -s
<html>
<head>
  <title>foo</title>
</head>
<body>
  <h1>foo2</h1>
  <h2>bar</h2>
</body>
</html>
^D
---
title: foo
---

# foo2

## bar
```

But what you usually want when converting a website to pandoc markdown is:

```
% pandoc -f html+top_heading_as_title -t markdown --atx-headers -s
<html>
<head>
  <title>foo</title>
</head>
<body>
  <h1>foo2</h1>
  <h2>bar</h2>
</body>
</html>
^D
---
title: foo2
---

## bar
```

Does that make sense?

it makes sense, but that doesn't mean we should parse h2 -> Header 2 in this case.

Converting HTML -> LaTeX, with this style of HTML, you'd generally want the h2's to convert to \section, not \subsection. So they'd need to be Header 1 in the AST. The same is true for many other formats pandoc supports.

One could try to address this at the HTML writer level. When the special extension or option is set, Header 1 renders as h2.

So, to summarize:

  • with -f html+top_heading_as_title, h1 goes to the metadata title and h2 goes to a Header 1.
  • with -t html+top_heading_as_title, metadata title goes to h1 and Header 1 goes to h2.

Ah, considering LaTeX output is indeed interesting.

I had the impression now that the recommend md for html export is the following (otherwise you get two <h1> with -s, which is usually not recommended):

title: foo
---

## bar

But you're saying that for LaTeX export you'd usually want the following?

title: foo
---

# bar

But you're saying that for LaTeX export you'd usually want the following?

Yes, for LaTeX and most other formats. With the ## you'd get subsections numbered 0.1, 0.2, etc. I also produce HTML this way (it's the default for pandoc), but I understand this has drawbacks.

Relevant old issue: #686.

I've been thinking about this some more and made peace with the fact that depending on what you're doing you'll always want to adjust your heading levels. Might depend on the output format you're converting to, or more importantly on where in any existing hierarchy your piece of text will fit into: say an existing website that uses <h1> for its logo text, or maybe you're writing a book and converting individual md files one by one to HTML pages (each should have an h1), but you concatenate all your md files when going to LaTeX.

So there will always be cases where you should simply make use of the --base-header-level option. Two things:

  • I'm not sure --base-header-level is the most intuitive name. It kind of leaves open what happens to all headings that are not at the 'base level'. What about something like --shift-heading-levels-by with the default value being 0?
  • Wouldn't it be nice if the option could also take negative values? If you would do --shift-heading-levels-by=-1 the first heading with level 1 would be set as the metadata title, and we wouldn't need yet another option/extension. I suppose other headings could be dropped, as well as headings that end up having a level <= 0.

Interesting idea. But isn't it a bit odd if

  • shift by -1 makes a level-1 heading the metadata title

unless also

  • shift by +1 makes metadata title a level-1 heading ?

This latter would definitely be a change to how --base-header-level=2 currently works.

EDIT: Even so I'm pretty positively disposed to this idea. I believe people have requested negative heading level shifts before. (See #4342)

shift by +1 makes metadata title a level-1 heading

yes, I think that would be useful in some rare cases as well.

I suppose --base-header-level should be deprecated then...

🎉

I think it makes sense that shifting between metadata and level-1 heading occurs in both directions. I don't find a compelling counter-example. But if the effect applies to all inputs after concatenation, then can the user not provide a document title that doesn't get demoted to a mere header?

Sometimes users split a book into files. Or a book may constitute a compilation of articles from sources written originally for standalone publication.

For example, consider inputs:

title:  Book Title
---
title: Beginning
---

In the beginning...
title: Ending
---

At the end...

The intention might be to represent:

title:  Book Title
---

# Beginning

In the beginning...

# Ending

At the end...

Could the global metadata input be protected from demotion? Could shifting be selected at the granularity of individual inputs, and applied before concatenation?

And would a title be handled differently if coming from the metadata given on the command line versus a YAML source? I wouldn't suggest a solution in which giving a title on the command line is the only way to protect it from demotion.

@brainchild0 see --file-scope.

Would this switch guard any single input file from transformations that would apply to others? I am not seeing any indication of such in the manual.

I don't understand the question. (But I recommend you experiment to find out.)

Following are the best experiments I can do currently:

Looking at my first post in the issue, notice a sequence of three examples of file contents following the line beginning with "For example". I place them in a.md, b.md, and c.md. The example following "The intention" I place in x.md.

Now I try:

~~~
$ pandoc x.md

Beginning

In the beginning…

Ending

At the end…

~~~

The files a.md, b.md, and c.md represent the idea that file x.md is being decomposed into parts. Since the latter two files represent chapters in the form of stanadalone documents, the level-1 headings are represented as the document titles.

The idea expressed in that post was that it might be useful if I could run a command using these three files, but is equivalent to pandoc x.md. This involves shifting b.md and c.md to the right one level, but "guarding" a.md.

With the current options, I think it is impossible. The closest approximation would be

$ pandoc a.md b.md c.md --shift-heading=1
<h1>Book Title</h1>
<p>In the beginning…</p>
<p>At the end…</p>

Actually, the result currently is that the chapter titles are dropped, because only one title may be recognized for the document.

With --file-scope there is no difference:

$ pandoc a.md b.md c.md  --shift-heading=1  --file-scope
<h1>Book Title</h1>
<p>In the beginning…</p>
<p>At the end…</p>

What would be needed is a way to shift the contents of b.md and c.md to right one level, such that the titles are demoted to level-1 headings, while a.md is "guarded", providing the actual title of the document.

At first it may seem like an unusual case, but I think probably not so. It seems that currently any positive shift value prevents the input stream from giving any data that is used for the document title of the output.

What would be needed is a way to shift the contents of b.md and c.md to right one level, such that the titles are demoted to level-1 headings, while a.md is "guarded", providing the actual title of the document.

Correct, that can't currently be done.
It would be desirable to make this sort of thing possible using a combination of --file-scope and --metadata-file. But that combination doesn't seem to work the way I'd expect.

% cat m.yaml
title: my real title
% cat a.md
---
title: hi
...

# ok
% pandoc --file-scope --metadata-file m.yaml a.md -t native -s
Pandoc (Meta {unMeta = fromList [("title",MetaInlines [Str "hi"])]})
[Header 1 ("ok",[],[]) [Str "ok"]]
% pandoc --file-scope --metadata-file m.yaml a.md -t native -s --shift-heading=1
Pandoc (Meta {unMeta = fromList []})
[Header 1 ("",[],[]) [Str "hi"]
,Header 2 ("ok",[],[]) [Str "ok"]]

I'd find it more intuitive (and useful) if the heading-shift transformation was done before the metadata was integrated, so the metadata from m.yaml isn't clobbered. However, this should be a new issue.

Actually on reflection, I'm not so sure about this.

I'd find it more intuitive (and useful) if the heading-shift transformation was done _before_ the metadata was integrated, so the metadata from m.yaml isn't clobbered. However, this should be a new issue.

It's not obvious how to create a new issue that captures the immediate concerns without including the history.

The sequence of processing would seem to be close to the following:

  1. Collect each file not including the one (or ones) containing the global metadata.
  2. For each file collected as such:

    1. Interpret it, including the metadata within it.

    2. Apply any appropriate shift, including, in the case of a positive shift, changing the title from the metadata into a heading.

    3. (Discard other metadata... I assume... or not?)

  3. Concatenate the results.
  4. Apply the global metadata.

Which parts of this discussion, if any, would you want moved to a new issue, and which would you be less open to seriously considering at the current moment?

Also, not sure about a compelling use case, but if a left shift squashes several header levels into one, then the original level of each affected header in principle can be preserved in ancillary data, like XHTML data- elements or class tags (e.g. <h1 data-original-level="-1">).

See #5957 for an unintended consequence of this change.

I'm going to roll back:

shift by +1 makes metadata title a level-1 heading

This breaks some workflows that used to be supported with --base-heading-level (see #5957).

Also: suppose you want to render a document with both latex and html. IT would be natural to use level-1 headings. But in the HTML version you might want level-2 headings, since the title will be rendered as level-1. So you'd want to shift heading levels, without depopulating the title in metadata.

Was this page helpful?
0 / 5 - 0 ratings