Pandoc: Empty paragraphs disappear in docx reader

Created on 30 Jun 2015 · 28Comments · Source: jgm/pandoc

To reproduce, create a minimal docx file:

$ cat <<EOF | pandoc -f native -o test.docx
[Para [Str "foo"]
,Para []
,Para [Str "bar"]]
EOF

Verify that it contains 3 paragraphs:

$ unzip test.docx
...
$ xpath word/document.xml '//w:p'
Found 3 nodes:
-- NODE --
<w:p><w:pPr><w:pStyle w:val="FirstParagraph" /></w:pPr><w:r><w:t xml:space="preserve">foo</w:t></w:r></w:p>
-- NODE --
<w:p><w:pPr><w:pStyle w:val="BodyText" /></w:pPr><w:r><w:t xml:space="preserve" /></w:r></w:p>
-- NODE --
<w:p><w:pPr><w:pStyle w:val="BodyText" /></w:pPr><w:r><w:t xml:space="preserve">bar</w:t></w:r></w:p>

Try converting back to Pandoc’s native format:

$ pandoc -i test.docx -t native
[Para [Str "foo"]
,Para [Str "bar"]]

Docx reader more-discussion-needed

Source

hftf

Most helpful comment

Further reflections: I think the change I made may have
been a bit hasty. It does seem to me now that "preserve"
would be a better default, not least because it doesn't
change previous behavior in a way that might cause problems.

So, here's a proposal:

Remove the --strip-empty-paragraphs option (or deprecate
it and make it do nothing).
Add a new extension, empty_paragraphs, which is disabled
by default and allows empty paragraphs.
Make the docx, html, docbook, and odt readers sensitive to this.
Make the docx, html, docbook, and odt writers sensitive to this.

I think this would be better than what we have now.
Thoughts?

jgm on 4 Dec 2017

👍2

All 28 comments

The same bug might have been discussed in this mailing list thread that I just found. Note that this issue is not about paragraph styles (margins or paddings before or after a paragraph) or making extra space by using several line breaks.

hftf on 1 Jul 2015

I’ve looked at the source code to find the bug. (Sorry I haven’t posted a pull request; I’m new to Haskell.)

I propose the following changes:

Reader/Docx.hs, lines 484–486

``` diff
if dropCap pPr
then do modify $ \s -> s { docxDropCap = ils' }
return mempty
else do modify $ \s -> s { docxDropCap = mempty }

return $ case isNull ils' of
True -> mempty
_ -> parStyleToTransform pPr $ para ils'
return $ parStyleToTransform pPr $ para ils'
```

I think this fixes the reported bug.

This change causes 13 Docx tests to fail. This is because the original files have empty paragraphs, and Pandoc currently strips them out. I think either the expected output of these tests should be carefully updated to include the new ,Para [] lines, or the empty paragraphs should be removed from the original files.

For example, this is a screenshot of the tests/docx/drop_cap.docx test. Note the empty paragraphs:

Currently, the expected output of this test does not include the empty paragraphs:

[Para [Str "Drop",Space,Str "cap."] ,Para [Str "Next",Space,Str "paragraph."] ,Para [Str "Drop",Space,Str "cap",Space,Str "in",Space,Str "margin."] ,Para [Str "Drop",Space,Str "cap",Space,Str "(not",Space,Str "really)."]]

But if the expected output were to be updated:

[Para [Str "Drop",Space,Str "cap."] ,Para [] ,Para [Str "Next",Space,Str "paragraph."] ,Para [] ,Para [Str "Drop",Space,Str "cap",Space,Str "in",Space,Str "margin."] ,Para [] ,Para [] ,Para [] ,Para [Str "Drop",Space,Str "cap",Space,Str "(not",Space,Str "really)."]]

Reader/Docx.hs, line 477

diff | otherwise = do ils <- concatReduce <$> mapM parPartToInlines parparts >>= - (return . fromList . trimLineBreaks . normalizeSpaces . toList) + (return . fromList . normalizeSpaces . toList)

I don’t think this breaks any current tests, so new tests should probably be added.

I think allowing line breaks at the beginning or end of a paragraph is a feature; it’s even supported in Markdown.

Writer/Docx.hs, line 769

diff -- fixDisplayMath sometimes produces a Para [] as artifact -blockToOpenXML _ (Para []) = return [] blockToOpenXML opts (Para lst) = do isFirstPara <- gets stFirstPara

I don’t think this breaks any current tests, so new tests should probably be added. If this change interacts badly with fixDisplayMath, then we might need to update that function.

In my opinion, empty paragraphs are a feature of Docx, (and HTML¹, etc.) files. In fact, empty paragraphs are already supported in the AST. Even if there isn’t a Markdown analog, I think that any stripping of empty paragraphs should happen in the AST → Markdown Writer step, and not the Docx Reader → AST step.

I did not yet have a chance to think of potential objections to these changes.

¹ Even though by default, empty paragraph margins are completely collapsed. But this can be easily fixed by transforming Para [] to   or  , or adding a CSS rule for p:empty.

I generated a sample Docx file after making the changes:

$ cat <<EOF > test.html
<p>Empty para [</p><p></p><p>]</p>
<p>Para with space then newline [</p><p> 
</p><p>]</p>
<p>Para with br [</p><p><br /></p><p>]</p>
<p>Para with empty em [</p><p><em></em></p><p>]</p>
EOF
$ pandoc-new test.html -o test-new.docx
$ diff -U5 <(pandoc-old -i test-new.docx -t native) <(pandoc-new -i test-new.docx -t native)
...
@@ -1,8 +1,12 @@
 [Para [Str "Empty",Space,Str "para",Space,Str "["]
+,Para []
 ,Para [Str "]"]
 ,Para [Str "Para",Space,Str "with",Space,Str "space",Space,Str "then",Space,Str "newline",Space,Str "["]
+,Para []
 ,Para [Str "]"]
 ,Para [Str "Para",Space,Str "with",Space,Str "br",Space,Str "["]
+,Para [LineBreak]
 ,Para [Str "]"]
 ,Para [Str "Para",Space,Str "with",Space,Str "empty",Space,Str "em",Space,Str "["]
+,Para []
 ,Para [Str "]"]]

Notice that the change to the Docx writer means test.docx will now have a <w:p> between Empty para [...] and Para with space then newline [...], even though the whitespace inside the latter is stripped.

$ pandoc-old test.html -o test-old.docx
$ unzip test-old.docx -d test-old
...
$ unzip test-new.docx -d test-new
...
$ diff -U2 <(xmllint --format test-old/word/document.xml) <(xmllint --format test-new/word/document.xml)
...
@@ -14,4 +14,9 @@
         <w:pStyle w:val="BodyText"/>
       </w:pPr>
+    </w:p>
+    <w:p>
+      <w:pPr>
+        <w:pStyle w:val="BodyText"/>
+      </w:pPr>
       <w:r>
         <w:t xml:space="preserve">]</w:t>
@@ -30,4 +35,9 @@
         <w:pStyle w:val="BodyText"/>
       </w:pPr>
+    </w:p>
+    <w:p>
+      <w:pPr>
+        <w:pStyle w:val="BodyText"/>
+      </w:pPr>
       <w:r>
         <w:t xml:space="preserve">]</w:t>

See here for a side-by-side diff.

Thanks in advance for your time.

hftf on 3 Jul 2015

+++ Ophir Lifshitz [Jul 02 15 21:13 ]:

I’ve looked at the source code to find the bug. (Sorry I haven’t posted
a pull request; I’m new to Haskell.)

I propose the following changes:

[1]Reader/Docx.hs, lines 484–486
if dropCap pPr
then do modify $ \s -> s { docxDropCap = ils' }
return mempty
else do modify $ \s -> s { docxDropCap = mempty }

return $ case isNull ils' of

True -> mempty

_ -> parStyleToTransform pPr $ para ils'

return $ parStyleToTransform pPr $ para ils'

I'll bet this conditional is there for a reason. I wouldn't
want to remove it before checking to see what the reason is.
@jkr can tell us more, I'll bet.

jgm on 3 Jul 2015

Thanks for the response. I traced the conditional back to either 11b0778 or 293e4cf. If it ends up being important, then I think at the very least there should be a test that depends on it. Should I open a PR in the meantime?

hftf on 3 Jul 2015

+++ Ophir Lifshitz [Jul 03 15 12:23 ]:

Thanks for the response. I traced the conditional back to [1]11b0778.
If it ends up being important, then I think at the very least there
should be a test that depends on it. Should I open a PR in the
meantime?

You can if you like. But I do want to hear from @jkr who
wrote this code.

jgm on 3 Jul 2015

The issue here is that docx is written visually, so there are invisible elements _everywhere_. If we allowed empty paragraphs, numerous ordinary documents would have hundreds of them. It made the most sense to collapse them. I don't believe that there's any good way to know for sure whether you meant to do it, so practically I think this is a much safer route. Remember, we're trying to get content and structure, not visual formatting.

My reasons aren't just practical, though -- I admit to a certain philosophical (sorry John) bias here as well. I think that empty paragraphs are a way of using structural information for visual formatting, and I don't think that makes sense in pandoc. If you want empty space after a first line with a dropcap, define an appropriate docs style (or css, or LaTeX sty, or whatever).

I'm open to trying the change if @jgm doesn't agree with my reasoning. But personally I think it will open up a huge can of worms for dubious gains.

jkr on 4 Jul 2015

👍1

Just to clarify my reasoning a bit further -- consider someone writing a file with no indentation and spaces between the paragraphs. In other words, trying to write something that looks like the default pandoc docx output. 99% of the time, they'll hit return twice to break their paragraphs. Do we really want to take that to be an empty paragraph? I'd argue no -- that wasn't the intention, just as it wouldn't be in markdown or LaTeX if someone put three blank lines instead of two or one.

In writing the reader, I try to produce output that seems as close as possible to what most people mean (so we take block indents to mean block quote, because that's how, statistically speaking, everyone does it). My experience with files in the wild is that people very rarely mean "empty paragraph" when they hit the empty key a few times. They _might_ mean "vertical space" but that's a very different thing, and not something that I think pandoc should try to deal with, any more than it should deal with the size or color of fonts.

jkr on 4 Jul 2015

I think I agree with @jkr here, and I know that his opinion is based on experience with a lot of real-world Word files. Pandoc is about document structure, and empty paragraphs used simply for formatting (to increase space between paragraphs) are not really structural elements.

Further relevant fact:
Empty paragraphs don't make sense at all in some output formats (LaTeX, Markdown, RST, ...). Note that the LaTeX and Markdown writers will simply omit empty paragraphs. So, for these formats it doesn't really matter whether empty paragraphs are allowed in the AST. However, empty paragraphs do have an effect in HTML (among others). So, the proposed change would cause divergences in the appearance of documents between formats: the empty paragraphs would affect the look of the output in HTML but not in LaTeX.

jgm on 4 Jul 2015

FWIW: Mammoth, another Docx to HTML converter, added an equivalent feature back in May 2015.

hftf on 4 Jul 2015

I want to reopen this. It wouldn't be too hard to add an ignore_empty_paragraphs extension, which we could enable in selective readers (starting with docx). It could default to true, and there'd be no difference in the default behavior of pandoc.

@jkr does this seem like a reasonable plan? (See recent pandoc-discuss posts.)

jgm on 30 Nov 2017

Sure -- seems reasonable to me.

There's a good list of proposed changes in the post above. They were a bit scattered around, because sometimes empty paragraphs are introduced. But trying to implement the changes, and seeing where it does the wrong thing (without the ignore_empty_paragraphs extension enabled) seems like the way to go.

jkr on 2 Dec 2017

Actually, I'm leaning now towards just making it the default to include the empty paragraphs, and adding a --strip-empty-paragraphs command-line option that just runs a filter to remove them (for all input formats). This seems simpler. We could advise in the documentation that --strip-empty-paragraphs is often useful for cleaning up docx files where empty paragraphs are used for spacing.

Would you be opposed to this, @jkr?

Note that if you're converting to markdown or latex, the empty paragraphs are basically ignored anyway. So this would only affect e.g. converting docx to odt, docbook, or html.
@hftf

jgm on 2 Dec 2017

Sounds sensible to me.

jkr on 2 Dec 2017

Shouldn't --strip-empty-paragraphs be made the default when converting to HTML?

mb21 on 3 Dec 2017

+++ Mauro Bieg [Dec 03 17 12:08 ]:

Shouldn't --strip-empty-paragraphs be made the default when converting
to HTML?

I looked up whether p tags are allowed to be empty in HTML.

https://stackoverflow.com/questions/14848326/empty-paragraph-tags-should-i-allow-in-html-editor-or-not

It seems that there is a recommendation (but not a
requirement) to avoid empty p tags, and that they're often
simply ignored by browsers.

So, one option would be to change the HTML writer so that it
omits empty p tags, always. That seems most reasonable to
me, and it keeps the behavior of --strip-empty-paragraphs
regular and predictable.

Any thoughts?

jgm on 3 Dec 2017

As far as I understand, the --strip-empty-paragraphs option only affects the Reader → AST step. If the HTML writer always strips empty paragraphs anyway, it defeats the purpose of that option.

I think the current HTML Writer behavior is reasonable enough as is. The  are basically harmless since they don't ordinarily have an effect on rendering. Though, it could also be modified to "support" empty paragraphs (i.e. render them visible), as I wrote above:

by default, empty paragraph margins are completely collapsed. But this can be easily fixed by transforming Para [] to   or  , or adding a CSS rule for p:empty.

This transformation could still be done easily by a user filter or by a CSS rule in the template, so it's not completely necessary to modify the HTML Writer.

(NB: I believe Para [LineBreak], Para [Space], etc. are equivalent to Para [] anyway.)

@jgm FYI, if you'd like tests for d6c58eb, there's blank_paragraphs.docx in #2305.

hftf on 3 Dec 2017

It seems that there is a recommendation (but not a requirement) to avoid empty p tags, and that they're often simply ignored by browsers.

That's true. However, depending on the CSS on the page, empty paragraphs will add unexpected vertical space. It's generally not good practice for a tool (like pandoc) to generate non-semantic empty paragraphs in HTML output.

So, one option would be to change the HTML writer so that it
omits empty p tags, always. That seems most reasonable to
me, and it keeps the behavior of --strip-empty-paragraphs
regular and predictable.

That sounds good to me.

@hftf If someone want the empty paragraphs to be retained in HTML output, he should use Para [Str '&nbsp']. This could be done with a filter, converting all the Para [] if necessary.

mb21 on 4 Dec 2017

That's true. However, depending on the CSS on the page, empty paragraphs will add unexpected vertical space. It's generally not good practice for a tool (like pandoc) to generate non-semantic empty paragraphs in HTML output.

By not using --strip-empty-paragraphs, you are indicating that empty paragraphs _are_ semantic, so that any vertical space is expected.

I don't have a strong opinion on whether not-stripping should be default or if it should be done the other way around (strip by default unless --preserve-empty-paragraphs is set) – if stripping is the much more common use case, but at least now the user has a choice.

hftf on 4 Dec 2017

@hftf I guess the default should be stripping on HTML/docbook output, and not stripping on docx/odt output...

mb21 on 4 Dec 2017

I think it should depend not on the output format, but on the kind of document – i.e. whether or not the document uses "semantic empty paragraphs," like for poetry – which is indicated by the new command line option. (It would be hard to detect automatically, anyway.)

hftf on 4 Dec 2017

The use case that motivated this was https://groups.google.com/forum/#!topic/pandoc-discuss/wlP6AL11NIY, i.e. the inability of Word to produce semantic input. Thus we need to preserve the empty paragraphs in the reader so that filters can handle them.

However, I think for the normal usecase, going from docx to html should remove empty paragraphs, since word docs often contains lots of empty lines for layouting purposes. (see also @jkr's comment above) Though I guess when going from docx to odt (or vice versa), it makes sense to keep the empty paragraphs. Thus, I think the default should depend on the output format. Or identically, as suggested by jgm, the HTML writer should just drop empty paragraphs (and keep those with a &nbsp).

mb21 on 4 Dec 2017

I think what you're really trying to argue for is --strip-empty-paragraphs being the default – which is fine with me. (In fact, 2½ years ago, I had suggested the reverse, --preserve-empty-paragraphs, as the solution to this issue in #2305.) If that's not correct, I'm sorry that I've failed to understand your idea. But these will be my last words on this new topic since it's already taken up a lot of time:

I won't complain if the HTML Writer does get modified to strip empty paragraphs. At that point, it would just be an annoyance to have to write a Para [] → Para [something] filter as a workaround. I'm happy enough that the new option has landed so that users are no longer completely prevented from converting a basic common type of document.

But I just don't see any sense architecturally in giving users the choice to not strip empty paragraphs, and then eliminate a choice by having the HTML Writer just strip them out anyway. Unlike Markdown, it is indeed possible to represent Para [] in HTML, as , and however meaningless that may seem, a user might want that. They had already executed their choice by using the new option or not.

The Docx Reader used to make the choice for the user – which is exactly what this issue is about:

Reader strips empty paras,
  giving user no choice
          |
          V
       Reader        Writer
Input  ------>  AST  ------> Output

Ideally, rather, every Reader should be agnostic about empty paragraphs, and let the user choose at the AST level. This is finally where we are now:

           User chooses how          (via filters or via new option,
         to treat empty paras        which is like a built-in filter)
                 |
                 |
       Reader    V   Writer
Input  ------>  AST  ------> Output

I feel like your proposal would go back to the same kind of situation as before:

             Writer strips empty paras,
               giving user no choice
                       |
                       V
       Reader        Writer
Input  ------>  AST  ------> Output

If anything about this issue should affect Writers, it's that Writers in which it is either impossible to represent an empty paragraph (Markdown) or in which it renders as nothing by default (HTML) should be modified to support visually rendering an empty paragraph (Para []). This is because the user has already decided to preserve it, and so modifying the Writers would save them the effort of writing a filter. Those Writers could essentially treat Para [] as Para [LineBreak] or Para [Str "\160"] or whatever, which Readers already normalize to be equivalent.

Below I try to illustrate why it doesn't depend on the output format, but on the kind of document.

This is what happens visually (by default, or with your suggested change):

uses semantic empty paragraphs     meaningless empty paragraphs     
(AB, CD are like poem stanzas)     (A, B, C are ordinary paragraphs)
==============================     =================================
            preserve                         strip        preserve  
          ------------                     ---------    ------------

Input     Output                   Input   Output       Output      
Docx      ODT  HTML                Docx    ODT  HTML*   ODT* HTML   

A         A    A                   A       A    A       A    A      
B         B    B                           B    B            B      
               C                   B       C    C       B    C      
C         C    D                                                    
D         D                        C                    C           

visually inconsistent; must        preserve: visually inconsistent
use custom filter workaround       strip: inconsistent, but obeys user's choice
                                          to use semantic paragraphs only

This is what would happen if the Writers are modified as I describe:

uses semantic empty paragraphs     meaningless empty paragraphs     
(AB, CD are like poem stanzas)     (A, B, C are ordinary paragraphs)
==============================     =================================
            preserve                         strip        preserve  
          ------------                     ---------    ------------

Input     Output                   Input   Output       Output      
Docx      ODT  HTML                Docx    ODT  HTML*   ODT* HTML   

A         A    A                   A       A    A       A    A      
B         B    B                           B    B                   
                                   B       C    C       B    B      
C         C    C                                                    
D         D    D                   C                    C    C      

visually consistent                preserve: visually consistent
                                   strip: inconsistent, but obeys user's choice
                                          to use semantic paragraphs only

The * indicates your proposal, which is to use preserve for Docx/ODT by default and to use strip for HTML/Docbook by default. To me, this seems inconsistent both visually and structurally – why enforce semantic paragraphs in one format and not the other? People use a mix of semantic elements and visual layout elements in both Docx and HTML formats anyway.

hftf on 4 Dec 2017

+++ Mauro Bieg [Dec 04 17 14:30 ]:

The use case that motivated this was
[1]https://groups.google.com/forum/#!topic/pandoc-discuss/wlP6AL11NIY,
i.e. the inability of Word to produce semantic input. Thus we need to
preserve the empty paragraphs in the reader so that filters can handle
them.

However, I think for the normal usecase, going from docx to html should
remove empty paragraphs, since word docs often contains lots of empty
lines for layouting purposes. Though I guess when going from docx to
odt (or vice versa), it makes sense to keep the empty paragraphs. Thus,
I think the default should depend on the output format. Or identically,
as suggested by jgm, the HTML writer should just drop empty paragraphs
(and keep those with a &nbsp).

I think it's best to keep the option conceptually simple.
Having the empty paragraphs in HTML is usually not going to
affect how the document displays. People who don't want
them can use --ignore-empty-paragraphs. People who want
whitespace to be inserted can use a filter to add a nbsp.

jgm on 4 Dec 2017

A couple notes:

One reason I chose to make --strip-empty-paragraphs the
option is that the stripping is implemented as a filter,
between the readers and the writer. I'd rather not run the
filter unless necessary, since most readers won't produce
empty Para's anyway. If stripping were the default we'd
be wasting some CPU cycles unnecessarily. Also, there's
something to be said for making preserving the default.
After all, Word has the concept of a paragraph (you can
even click an option to see the little paragraph marks).
The user added an empty paragraph. They may not know it,
but they did! So the most predictable thing for pandoc to
do is to produce an empty paragraph. If users don't like
the result, they can use the option to strip these.
I don't think it matters too much whether the HTML
writer omits empty p tags. If we made this change, it would
still be possible to write a filter if you wanted to keep
these. You'd want to do that anyway, since the way browsers
treat empty p tags may not be consistent. So I'm inclined
to think this change would make sense. @hftf's main goal,
the retention of semantic information about empty
paragraphs, would still be achieved.

jgm on 4 Dec 2017

So, here's a proposal:

Remove the --strip-empty-paragraphs option (or deprecate
it and make it do nothing).
Add a new extension, empty_paragraphs, which is disabled
by default and allows empty paragraphs.
Make the docx, html, docbook, and odt readers sensitive to this.
Make the docx, html, docbook, and odt writers sensitive to this.

I think this would be better than what we have now.
Thoughts?

jgm on 4 Dec 2017

👍2

OK, so I've implemented a new approach.

--strip-empty-paragraphs is deprecated.

New extension: empty_paragraphs, disabled by default,
implemented for docx reader + writer, opendocument/odt writer,
html reader + writer.

I think this is a better approach. It restores the default
behavior of the docx reader to the way it was before 2.0.4
(no empty paragraphs).

jgm on 5 Dec 2017

Sounds good. I'm curious what changes were made to the writers, but I'll just wait until you push it to try it out.

hftf on 5 Dec 2017

It has been pushed.

The affected writers omit empty paragraphs unless the
empty_paragraphs extension is enabled.

+++ hftf [Dec 04 17 15:19 ]:

Sounds good. I'm curious what changes were made to the writers, but
I'll just wait until you push it to try it out.

—
You are receiving this because you were mentioned.
Reply to this email directly, [1]view it on GitHub, or [2]mute the
thread.

References

https://github.com/jgm/pandoc/issues/2252#issuecomment-349139977

https://github.com/notifications/unsubscribe-auth/AAAL5FG_obQKbJcawELNJR6SNjZCPQ7Dks5s9H4LgaJpZM4FOkkN