Pandoc: Implement "First Paragraph" style in docx writer?

Created on 10 Feb 2015  路  23Comments  路  Source: jgm/pandoc

The ODT writer has a nice feature: it has a separate "First Paragraph" style based on "Text Body", which it uses for the first paragraph after headers, block quotes, etc. This is nice because it allows indenting paragraphs but leaving paragraphs after blockquotes and figures unindented. (Of course, by default, "First Paragraph" and "Text Body" are the same, but they _can_ be styled differently.)

This is important enough to me that for a large manuscript, I had to convert to ODT, download LO.o, and then save as docx, and then go from there. I'd prefer to go straight to Docx, since then I could use my reference stylesheet.

Would there be any objections to me implementing this for the Docx writer?

Most helpful comment

@melaniewalsh : the simplest way is to explicitly put a div with that custom-style around it:

~~~markdown

Section

Some text.

Block quote

::: {custom-style="BodyText"}
Some more text.
:::

And this.
~~~

But this is a bit of a hassle, and makes it harder to move paragraphs around. So you could write a filter that would do something similar. Here's one way to do it (though perhaps not the best):

~~~lua
local after_bq = false

function convert (blk)
if blk.t == "Para" and after_bq then
after_bq = false
return pandoc.Div({blk}, pandoc.Attr("", {}, {["custom-style"]="BodyText"}))
else
if blk.t == "BlockQuote" then
after_bq = true
else
after_bq = false
end
return blk
end
end

function Pandoc(doc)
local out = doc.blocks:map(convert)
return pandoc.Pandoc(out, doc.meta)
end
~~~

Save this in a file (after_bq.lua, or whatever) and then run pandoc with pandoc input.md --lua-filter after_bq.lua -o output.docx).

The script just iterates over all the toplevel blocks, and if one comes after a block quote, it surrounds it with that custom-style div. If you want to have it work on sub-blocks (in notes and blockquotes) it might be a bit more complicated. For more on lua filters see here: https://pandoc.org/lua-filters.html

All 23 comments

Pardon me if I haven't understood your issue properly.
Could you use a reference docx as suggested here: https://github.com/jgm/pandoc-templates/issues/20

`--reference-docx=`*FILE*
:   Use the specified file as a style reference in producing a docx file.
    For best results, the reference docx should be a modified version
    of a docx file produced using pandoc.  The contents of the reference docx
    are ignored, but its stylesheets and document properties (including
    margins, page size, header, and footer) are used in the new docx. If no
    reference docx is specified on the command line, pandoc will look
    for a file `reference.docx` in the user data directory (see
    `--data-dir`). If this is not found either, sensible defaults will be
    used. The following styles are used by pandoc: [paragraph]
    Normal, Compact, Title, Subtitle, Authors, Date, Abstract, Heading 1,
    Heading 2, Heading 3, Heading 4, Heading 5, Block Quote, Definition Term,
    Definition, Bibliography, Body Text, Table Caption, Image Caption;
    [character] Default Paragraph Font, Body Text Char, Verbatim Char,
    Footnote Ref, Link.

You could then customize styles in the resulting reference docx, as for example:
docx style

Yes, but the particular style I want (one that occurs in paragraphs following block quotes, headers, figures, etc) isn't in the output, so I can't customize it.

Could you please upload a sample source and the resulting docx, if you have time? I'm pretty new to Pandoc, so I'm not sure what styles are included in this specific kind of document.

After experimenting and peeking at the code in OpenDocument.hs,

setFirstPara :: State WriterState ()
setFirstPara =  modify $  \s -> s { stFirstPara = True }

I understand your request (as far as I can see, docx writer is not looking for first paragraphs).

Modifying styles in docx reference file will not help, as it will not impact the style on existing text ('normal' for paragraphs). i.e 'Style for following paragraph' only affects the text typed in afterwards.

Yep, that's the issue. I've already written the code for it to be in the docx reader. Just wanted to check and see whether @jgm had any objections to me putting it in.

I'm looking forward to your docx writer patch (being incorporated)!

No objections!

I assume the "First Paragraph" style will depend on/inherit the "Paragraph" style so that changes to the latter are automatically reflected in the former, right?

Yep -- or, rather it derives from "Normal", which seems to be the base paragraph style in reference.docx.

Wouldn't it be better to use the same style names used in Open Document writer? (And of course define these accordingly in Word).

@nkalvi -- not sure why, necessarily. I guess it could be done in the future, but seems a different issue.

@jgm -- I've tested it and it seems to work (as well as I can be sure of with the lack of docx writer test). Should I push it or would you like to take a look through a PR?

Of course not necessary. I just prefer the name 'Text Body' to 'Normal'. Looking forward to the update!

As for testing, can one use the native files under https://github.com/jgm/pandoc/tree/master/tests/docx?

@jgm -- I've tested it and it seems to work (as well as I can be sure of with the lack of docx writer test). Should I push it or would you like to take a look through a PR?

Go ahead and push it.

cool -- pushed.

@jkr Sorry to drag this up again, but do you know a way to make the style following a block quote not "First Paragraph"? I want indented paragraphs after headers but unindented paragraphs after block quotes, and I'm having trouble separating them as styles.

@melaniewalsh : the simplest way is to explicitly put a div with that custom-style around it:

~~~markdown

Section

Some text.

Block quote

::: {custom-style="BodyText"}
Some more text.
:::

And this.
~~~

But this is a bit of a hassle, and makes it harder to move paragraphs around. So you could write a filter that would do something similar. Here's one way to do it (though perhaps not the best):

~~~lua
local after_bq = false

function convert (blk)
if blk.t == "Para" and after_bq then
after_bq = false
return pandoc.Div({blk}, pandoc.Attr("", {}, {["custom-style"]="BodyText"}))
else
if blk.t == "BlockQuote" then
after_bq = true
else
after_bq = false
end
return blk
end
end

function Pandoc(doc)
local out = doc.blocks:map(convert)
return pandoc.Pandoc(out, doc.meta)
end
~~~

Save this in a file (after_bq.lua, or whatever) and then run pandoc with pandoc input.md --lua-filter after_bq.lua -o output.docx).

The script just iterates over all the toplevel blocks, and if one comes after a block quote, it surrounds it with that custom-style div. If you want to have it work on sub-blocks (in notes and blockquotes) it might be a bit more complicated. For more on lua filters see here: https://pandoc.org/lua-filters.html

@jkr Yes!! Sweet relief. This lua filter worked perfectly. I didn't even know about lua filters before.

Thanks so much for taking the time to respond to my question. I had been trying to figure this out for a while, and I really appreciate it.

Glad it helped. For the sake of general correctness, or if anyone else comes across this discussion, a better way to implement that same filter:

~~~lua
local after_bq = false

return {
{
BlockQuote = function (blk)
after_bq = true
end,

  Para = function (blk)
 if after_bq then
    after_bq = false
    return pandoc.Div({blk},
       pandoc.Attr("", {}, {["custom-style"]="BodyText"}))
 end
  end,

  Block = function (blk)
 after_bq = false
  end

}
}
~~~

The Block is a fallback that only runs if the others fail.

Sorry if I'm misunderstanding, but in reading about this issue, I got the impression that the docx writer had been improved to use "First Paragraph" rather than "Normal" style for paragraphs that immediately follow headings (or other, non-body text). I'm still finding use of the Normal style, however.

I've now discovered that First Paragraph style is, in fact, used if it is already defined in the reference docx. If it is not defined, Normal style is used. I guess I misunderstood, thinking that Pandoc would create such a style in the target docx if it was not already defined. Am I understanding this right?

If you're not using any reference-doc, pandoc will create the style for you. If you are already using a custom reference-doc, the styles are retrieved from it, so you should have the "First Paragraph" style defined there. Otherwise I guess Normal is the default for undefined styles.
I think this is something better asked in pandoc's discussion list

Thanks. I'll check out the discussion list, but I'm raising this here because of a potential bug in either the software or documentation, which says the following:

"For docx output, styles will be defined in the output file as inheriting from normal text, if the styles are not yet in your reference.docx. If they are already defined, pandoc will not alter the definition."

This is a bit ambiguous about the situation I describe, but to me, it implies that Pandoc will create a First Paragraph style in the target docx, inheriting from Normal, if it is not already found in the reference docx.

I have also noticed that, when converting in the reverse direction from docx to md, syntax for First Paragraph style is placed in the md, even when it was Normal style after a heading in the source docx. So, I think it would be more consistent behavior if First Paragraph and other styles were created in an md-to-docx conversion if they are not found in the reference docx. Of course, if they are found, those predefined styles should be used instead.

If the software is not changed, then I suggest making the documentation more clear about the behavior.

I may raise other inconsistencies that I have noticed regarding docx conversions, but I will do so in separate threads. With that said, let me express my gratitude to the folks working on these capabilities, which I use daily in my work!

I am happy to avoid (in most cases) the "First paragraph" style. In ODT, the fallback is root level style. In most of my use cases, a fallback to "Body Text" would be to prefer, since that corresponds to the style of all other paragraphs.

Any thoughts?

(With the risk of poking the wrong thread, given the talk about discussion lists, etc.)

Was this page helpful?
0 / 5 - 0 ratings