Pandoc: Horizontal Rule in DOCX

Created on 4 Dec 2015  Â·  20Comments  Â·  Source: jgm/pandoc

Consider the following text:

# Chapter
## Section

Section text.

---

Divided section.

## Second Section
# Second Chapter

This generates a document that, when opened in LibreOffice 5.x, resembles:

example-output

The Haskell code that generates the XML looks like:

blockToOpenXML _ HorizontalRule = do
  setFirstPara
  return [
    mknode "w:p" [] $ mknode "w:r" [] $ mknode "w:pict" []
    $ mknode "v:rect" [("style","width:0;height:1.5pt"),
                       ("o:hralign","center"),
                       ("o:hrstd","t"),("o:hr","t")] () ]

The width:0 does not appear to be correct.

Note that typical manuscript format (for submitting to publishers) prefers a centered # character to a horizontal rule. If there was a way to configure how a horizontal rule should be presented (e.g., centered, text, or line width and thickness) from the command-line (when generating DOCX or RTF files), that would be quite useful.

For example:

pandoc --reference-docx=reference.docx -s --hr-width 50% --hr-height .5pt
pandoc --reference-docx=reference.docx -s --hr-text "* * *" --hr-justify right
pandoc --reference-docx=reference.docx -s --hr-text "#" --hr-justify center

The hr could equally be horizontal-rule or rule.

The XML to center a hashmark, for example:

<w:p>
  <w:pPr>
    <w:pStyle w:val="Normal"/>
    <w:jc w:val="center"/>
    <w:rPr></w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr></w:rPr>
    <w:t>#</w:t>
  </w:r>
</w:p>

It's possible to write a shell script to work around the issue, but it's a fairly cumbersome hack:

#!/bin/bash

# Filename to update
MANUSCRIPT=filename.docx

chmod 644 $MANUSCRIPT

# Use the reference document (template) to create the DOCX version.
pandoc \
  --reference-docx=reference.docx \
  -o $MANUSCRIPT \
  -s *.md

WORKING_DIR=/tmp/replacement/
DOC_DIR=$WORKING_DIR/word

rm -rf $WORKING_DIR
mkdir -p $WORKING_DIR

# Modify the DOCX file directly.
unzip $MANUSCRIPT -d $WORKING_DIR

# Replace the horizontal rule with a centered hashmark.
sed -e 's!<w:p><w:r><w:pict><v:rect style="width:0;height:1.5pt" o:hralign="center" o:hrstd="t" o:hr="t" /></w:pict></w:r></w:p>!<w:p><w:pPr><w:pStyle w:val="Normal"/><w:jc w:val="center"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>\#</w:t></w:r></w:p>!g' $DOC_DIR/document.xml > $DOC_DIR/document.xml.tmp && mv $DOC_DIR/document.xml.tmp $DOC_DIR/document.xml

# Repack the document.
pushd $WORKING_DIR
zip -9 -r /tmp/$MANUSCRIPT .
popd
rm -rf $WORKING_DIR

# Replace the existing manuscript with the revised version.
mv /tmp/$MANUSCRIPT .
Docx writer

Most helpful comment

https://gist.github.com/Merovex/05e3216f8f4f6e965cd9d564b1496719

I created a gist of the entire lua file.

All 20 comments

I can confirm (this excellent) bug report.

Knowing zip about Word's xml, I messed around a tiny bit with the document.xml file and found that deleting that last style setting, ,("o:hr","t") in the code, and giving the width an actual value like 50pt creates something, but it's not centered. Creating a rectangle directly leaves very different xml.

In Word, the output looks fine -- you get a horizontal line spanning the entire width.

So, is this perhaps a LibreOffice issue?

you get a horizontal line spanning the entire width.

What if a centered hash (or three) is desired instead of a horizontal rule (as per manuscript submission guidelines, which vary from publisher to publisher)?

If one of the main uses of Markdown is to separate content from presentation, then having no way to change the presentation of --- into something other than a horizontal rule (in Microsoft/Libre Office documents) seems limiting.

Would it be helpful to split this bug report (doesn't work in LibreOffice) from the feature request (provide more output options for horizontal rules)?

I can confirm -w docx creates a rule visible in Word but not LibreOffice.
Conversely, -w odt creates a rule visible in LibreOffice but not Word!

It would be nice to have something that was compatible in both but I don't know if that is possible.

Well, that's annoying.

As I said above, I know zip about this, but I'm happy to post something over at LibreOffice's tracker, if no one with more knowledge wants to get involved. :-)

If we replaced pandoc's horizontal rules in docx and odt
with something like a centered paragraph containing #,
it would be more portable, but this would also make the
output less like that in other formats.

+++ John Muccigrosso [May 06 16 08:07 ]:

Well, that's annoying.

As I said above, I know zip about this, but I'm happy to post something
over at LibreOffice's tracker, if no one with more knowledge wants to
get involved. :-)

—
You are receiving this because you commented.
Reply to this email directly or [1]view it on GitHub

References

  1. https://github.com/jgm/pandoc/issues/2573#issuecomment-217468046

I'm not familiar with the markup of odt or docx, but I imagine a single cell table of 100% width and bottom border would be cross compatible in both apps.

If we replaced pandoc's horizontal rules in docx and odt with something like a centered paragraph containing #, it would be more portable, but this would also make the output less like that in other formats.

I didn't mean to imply that the horizontal rules should be replaced altogether. I meant to imply it should be an option. Like a factory method that produces the function to churn out the desired transformation depending on a user setting.

For example:

parameters = "# # #"
separator = RuleFactory.create( "horizontal-sep", parameters )
separator.print
separator = RuleFactory.create( "text-sep", parameters )
separator.print
separator = RuleFactory.create( "text-sep", "#", centered = false )
separator.print
separator = RuleFactory.create( "xml-sep", "<custom xml>" )
separator.print

The cases:

  1. Parameters are ignored and the separator is a horizontal rule (the default). How it is implemented (rule vs. table) is a detail.
  2. Parameters inject text separator (of three hashes), centered by default.
  3. Text (single hash) is used but not centered.
  4. Separator is created using a custom XML fragment injected directly into the document.

In other words, instead of hard-coding a few different solutions, a general purpose solution that has some intelligent defaults (like a regular horizontal rule) that allows any user-supplied text to be injected would be rather useful (for cases we cannot currently conceive). This would, for example, allow a user to use any type of totally sweet horizontal rule, limited by imagination, not code.

I did a little internet searching on this topic and found that there are two recommended methods in Word for inserting a horizontal rule (hr) and both work on the 2011 version I have on my machine

  1. Type three hyphens and return and Word will replace that with an hr. (ref)
  2. Use the border tool to insert "horizontal." (ref)

Only the first method (hyphens) shows as a rule in LibreOffice, which suggests that there's a bug in the border method, and so that perhaps pandoc should use 1 instead of 2. Here's the relevant document xml from my very short document (formatted for better readability).

Method 1 produces 2 paragraphs in my doc:

<w:p w14:paraid="37846EB0" w14:textid="77777777" w:rsidr="001F3A44" w:rsidrdefault="001F3A44">
    <w:ppr>
        <w:pbdr>
            <w:bottom w:val="single" w:sz="6" w:space="1" w:color="auto" />
        </w:pbdr>
        <w:rpr>
            <w:rfonts w:ascii="Times New Roman" w:hansi="Times New Roman" />
        </w:rpr>
    </w:ppr>
</w:p>
<w:p w14:paraid="6E972163" w14:textid="77777777" w:rsidr="001F3A44" w:rsidrdefault="001F3A44">
    <w:ppr>
        <w:rpr>
            <w:rfonts w:ascii="Times New Roman" w:hansi="Times New Roman" />
        </w:rpr>
    </w:ppr>
</w:p>

The border method (which does _not_ work in LibreOffice) produces this:

<w:p w14:paraid="1C789044" w14:textid="77777777" w:rsidr="001F3A44" w:rsidrdefault="001F3A44" w:rsidp="001F3A44">
    <w:ppr>
        <w:pbdr>
            <w:between w:val="single" w:sz="4" w:space="1" w:color="auto" />
        </w:pbdr>
    </w:ppr>
</w:p>
<w:p w14:paraid="58433B4C" w14:textid="77777777" w:rsidr="001F3A44" w:rsidrdefault="001F3A44" w:rsidp="001F3A44">
    <w:ppr>
        <w:pbdr>
            <w:between w:val="single" w:sz="4" w:space="1" w:color="auto" />
        </w:pbdr>
    </w:ppr>
</w:p>

For some reason, although both methods insert a paragraph border, only the first works in LibreOffice.

I didn't try to make a minimal case, but method 1 certainly seems better than the current pandoc approach of inserting a table. Given that it's a paragraph border, I think the only way to affect the width is to change the paragraph size. If I give it a 3cm indent on each side (can Word handle percentage values in its xml?), I get the following for the first paragraph:

<w:p w14:paraid="37846EB0" w14:textid="77777777" w:rsidr="001F3A44" w:rsidrdefault="001F3A44" w:rsidp="00215DC2">
    <w:ppr>
        <w:pbdr>
            <w:bottom w:val="single" w:sz="6" w:space="1" w:color="auto" />
        </w:pbdr>
        <w:ind w:left="1701" w:right="1701" />
        <w:rpr>
            <w:rfonts w:ascii="Times New Roman" w:hansi="Times New Roman" />
        </w:rpr>
    </w:ppr>
    <w:bookmarkstart w:id="0" w:name="_GoBack" />
    <w:bookmarkend w:id="0" />
</w:p>

Hopefully somebody can tweak the writer to adopt this method.

"Note that typical manuscript format (for submitting to publishers) prefers a centered # character to a horizontal rule." @DaveJarvis

Adding my voice in support of making the HorizontalRule in docx / odt configurable by users. I'm working on converting markdown into the de facto standard for submitting short stories and novels. Both use a centered # character as mentioned in the original issue.

I've managed to recreate most of the format using just a modified reference doc, but having the HorizontalRule hard-coded makes complete automation impossible. I'm not a Haskell developer, but let me know how I can help.

If this could be handled with styles, we could customize in
the reference.docx, but I don't believe it can be.

One possible approach would be to use a lua filter to
convert HorizontalRule elements into the appropriate raw
openxml code. This would be easy, I just need to know what
that code is. From a quick experiment, something like this
may work:







#

+++ David L. Day [Feb 04 18 11:47 ]:

"Note that typical manuscript format (for submitting to publishers)
prefers a centered # character to a horizontal rule." [1]@DaveJarvis

Adding my voice in support of making the HorizontalRule in docx / odt
configurable by users. I'm working on converting markdown into the de
facto standard for submitting [2]short stories and [3]novels. Both use
a centered # character as mentioned in the original issue.

I've managed to recreate most of the format using just a modified
reference doc, but having the HorizontalRule hard-coded makes complete
automation impossible. I'm not a Haskell developer, but let me know how
I can help.

—
You are receiving this because you commented.
Reply to this email directly, [4]view it on GitHub, or [5]mute the
thread.

References

  1. https://github.com/davejarvis
  2. https://www.shunn.net/format/story.html
  3. https://www.shunn.net/format/novel.html
  4. https://github.com/jgm/pandoc/issues/2573#issuecomment-362933887
  5. https://github.com/notifications/unsubscribe-auth/AAAL5Hh2NwyMAKV6VvI_od5t0NXbNpX4ks5tRgldgaJpZM4Gus0c

Lua filter would be this simple (untested).

local hashrule = [[<w:p>
  <w:pPr>
    <w:ind w:left="576" w:right="576"/>
    <w:jc w:val="center"/>
  </w:pPr>
  <w:r>
    <w:t>#</w:t>
  </w:r>
</w:p>]]

function HorizontalRule(el)
    return pandoc.RawBlock('openxml', hashrule)
end

I took the output I was generating, replaced the horizontal line with single #, centered, no indent. Unzipping, here's what I found:

    <w:p w:rsidR="00F8392C" w:rsidRDefault="00F8392C" w:rsidP="00F8392C">
      <w:pPr>
        <w:pStyle w:val="BodyText"/>
        <w:ind w:firstLine="0"/>
        <w:jc w:val="center"/>
      </w:pPr>
      <w:r>
        <w:t>#</w:t>
      </w:r>
    </w:p>

Looks roughly the same, but uses the BodyText styling from the reference doc.

I had looked at the filters documentation, but it didn't seem like I could do this sort of transformation. Admittedly, I have not looked at the lua filters documentation, so I'll get into that now. Writing custom filters is perfectly acceptable for what I'm trying to do. Thank you!

The lua filter worked perfectly! Again, thank you. I must have skipped the lua docs b/c Ubuntu still has pandoc 1.9x. I used LinuxBrew to get 2.1.x installed, and should be able to do everything else I need now.

Not wanting to re-open, but wanting to thank @jgm for the centered Hash solution for Docx format.

@davidlday @jgm Could you explain what I need to do to achieve this? I don't know how to use lua filters. I'm also trying to use pandoc to convert markdown into manuscript odt. I'm working with odt format, btw, not docx.

https://gist.github.com/Merovex/05e3216f8f4f6e965cd9d564b1496719

I created a gist of the entire lua file.

@davidlday @jgm Could you explain what I need to do to achieve this? I don't know how to use lua filters. I'm also trying to use pandoc to convert markdown into manuscript odt. I'm working with odt format, btw, not docx.

See the gist by @Merovex. If you need an example of how to implement, take a look at my pandoc-templates.

@davidlday @Merovex Thanks!

Was this page helpful?
0 / 5 - 0 ratings