Pandoc: DOCX Template Seems Corrupt?

Created on 11 Feb 2012  Â·  117Comments  Â·  Source: jgm/pandoc

I've using Pandoc 1.9.1 on Windows with Word 2007 and I'm running into some problems and I think it's the reference.docx file.

If I use the stock reference.docx file it produces a valid word document without error. If I touch the reference.docx file (change a style, resave, whatever) any documents created with that will give the error:

"The file FILENAME cannot be opened because there are problems with the contents."

If I choose the "Recover" option I see an error "Styles 1".

Looking at the styles.xml inside the reference docx file it seems like there may be some issues. The "Date" styleid has a name set to "Authors".

bug

All 117 comments

I fixed the problem with the Date style, but that didn't seem to be the real problem. I can reproduce the issue on my Mac.

Yeah I tried that too and same result. How was the original DOCX created? By hand or through Word?

I'm thinking this might be relevant:
http://idippedut.dk/post/2010/04/22/Correct-according-to-spec-or-implementation.aspx

I tried modifying reference.docx in Word, and saving as newref.docx. I then created a document r2.docx using

pandoc -o r2.docx --reference-docx=newref.docx

Tried to open this in Word, got the error, quit Word without recovering. Then I edited r2.docx in emacs, went into the file .rel/_rels, and changed 'http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties' to 'http://schemas.openxmlformats.org/officedocument/2006/relationships/metadata/core-properties'. After that I was able to open the file in Word without problems.

This (plus the linked page) suggests to me that the problem is with Word, not pandoc. Word is actually breaking .rel/_rels when you save the new reference docx. If this is right, then I'm not sure there's anything to fix in pandoc. But I'm not sure.

It seems that Word is also creating another file for the metadata: originally there is docProps/core.xml, but after saving with Word you also have docProps/core0.xml. I think this must be related to the problem above.

Tried with the updated reference.docx. The problem is still there. If I change it at all I get the same error.

Here is a template with just the title color changed:

http://f.nddn.net/p/zpagp87zvp/reference.docx

And here is the resulting Word document:

http://f.nddn.net/p/xsk9hvbr3p/README.docx

Yes, that last change didn't fix it for me either. Did you try hand-editing .rel/_rels as described above?

It seems to me that, strange as it seems, Word is at fault here. Unless the validator at http://www.probatron.org:8080/officeotron/officeotron.html is wrong.

I suppose I could have pandoc undo the change Word makes to .rel/_rels.

Let me know if this fix doesn't work for you.

Success. Thank you!

I got really similar problems with Pandoc 1.9.2 on Arch Linux when I modify the reference.docx and add an image to the document.

For example the following simple pandoc file fails to load without error in Word 2007:

![caption](img.jpg)

I am not sure if this is related.

So, documents without images work okay with this reference.docx?

+++ Gergely Daróczi [May 18 12 14:10 ]:

I got really similar problems with Pandoc 1.9.2 on Arch Linux when I modify the reference.docx and add an image to the document.

For example the following simple pandoc file fails to load without error in Word 2007:

![caption](img.jpg)

Reply to this email directly or view it on GitHub:
https://github.com/jgm/pandoc/issues/414#issuecomment-5795858

Thanks for prompt reply!

Absolutely: I can add text, heading, tables etc. without any error.

I've got an MS Office 2010 Trial since my last message here, so I am updating this issue.

I get the same error message while trying to load a generated docx file based on hardly modified reference.docx (all I did is: added a space after "Hello World!", deleted that space and saved the file): "The file is corrupt and cannot be opened".

If I click "Ok", then the docx is repaired and everything works like a charm.

Minimal example for the issue:

I was havivng this same error message and workflow, and then msword 2011 failed to show a footnote.

What version of pandoc?

+++ singingfish [Feb 17 13 01:49 ]:

I was havivng this same error message and workflow, and then msword
2011 failed to show a footnote.

--
Reply to this email directly or [1]view it on GitHub.
[xJAuenYDiIoVt3LF3y6840kBV5sLC6YcF7-sicitJHjGU3MIRXzK9TuF4zLMcu4b.gif]

References

  1. https://github.com/jgm/pandoc/issues/414#issuecomment-13683444

oh sorry. 1.10.1

@singingfish: It would be helpful if you could do what daroczig does above, and provide a link to a reference docx and markdown file that are sufficient to reproduce the problem.

sure. Here you go. http://www.4shared.com/folder/KQ6HWfXr/_online.html Test file is test.txt, build script is build.sh, template is master_document.docx (based on a previous pandoc run) and output file demonstrating the problem is test.docx

I am having a similar issue. When I try to open the generated Word doc in 2007 (Windows) it tells me that "one or more of the footnotes in the document are missing or corrupt" (the markdown file does not have any footnotes in it). When I open the recovered file Word says it repaired "Style 1" at the beginning of the document.

I am using 1.10.1 on Mac (Mountain Lion). The command I use generate the file is:

pandoc --reference-docx=/Users/klaus/Dropbox/Elements/Templates/reference.docx -f markdown+pipe_tables -t docx "pandoc-test.md" -o "pandoc-test.docx"

The original markdown text file, the reference.docx template, and the generated docx can be found here.

https://www.dropbox.com/s/44s7p3unul9y62v/pandoc-test.zip

I can confirm this on my Mac. If I edit a copy of reference.docx with Word 2011, then save the result and use it as a reference-docx with pandoc, I get a corrupted file. This was using pandoc's README as input, which has footnotes. When I tried again with a source file without footnotes, I did not get a corrupted file.

I noticed that in the modified reference.docx saved by Word, the file word/styles.xml contains

w:styleId="FootnoteReference1">
<w:name w:val="Footnote Reference1" />

where the original reference.docx contains

w:styleId="FootnoteReference">
<w:name w:val="Footnote Reference" />

This change could account for the corruption. I have no idea why Word changes the style ID from FootnoteReference to FootnoteReference1, but possibly it's because Word already has a default style with id FootnoteReference.

I tried changing the ID in the modified reference.docx, but it didn't solve the problem. There may be other problems of this sort.

Similar problem with the Hyperlink style, which becomes Hyperlink1. But fixing this didn't solve the problem, so there is still something else...

As observed in the thread above, the problem does seem to have something to do with footnotes, as it disappears when I remove all footnotes from the input file (pandoc README). A file containing a single image is also sufficient to trigger the problem. So that narrows things down: we need to look at footnotes and images.

I think I've seen problems with table and/or figure captions as well.

I've seen this problem for a while either with table or footnote in
the generated docx file. I think this problem might be relate to the
versions of Word XML format. Evidences are(Tested under Win7 64bit,
Word 2012):

  • pandoc with a template docx file created with Word 2012, pandoc will
    fail to recognize it as valid docx file.
  • Unzip and re-zip the reference.docx, and then open it in Word
    2012, an error message pops up. Same actions on other docx files
    edited with Word 2012 won't get the error message.

Best,

Chen, Huashan

On Mon, Feb 25, 2013 at 2:44 PM, singingfish [email protected]
wrote:

I think I've seen problems with table and/or figure captions as well.

—
Reply to this email directly or view it on GitHub.

@singingfish - Tables work fine in all my tests. But an image with caption would have the same problem as an image.

@huashan - "pandoc will fail to recognize it as a valid docx file": What error message, exactly, do you get from pandoc? And what exactly are you doing -- modifying the standard reference.docx with Word 2012? Have you tried checking the "compatibility mode" box when you save, and does that make any difference?

"unzip and re-zip" - I have a really hard time seeing how this could make a difference. Are you sure you are rezipping with the same directory structure? What commands are you using, exactly, to do this test? I tried it with Word 2011 and had no problems. Does the unmodified reference.docx open all right in Word 2012?

Moving forward with this issue: The way to debug this is by using tiny files that reproduce the problem, the first with just one footnote, the second with just one image. By examining the difference between the "rescued" docx and the original, we may be able to figure out what is going on.

Bigger picture: Given that Word seems to make arbitrary changes in the document when resaving it, I wonder whether it would be better to scrap the "reference.docx" idea entirely and instead allow a "styles.xml" file to be specified on the command line. The disadvantage of this, of course, is that it would be much harder for people to modify the styles. But at least it would work!

OK, I've isolated the problem with images. The original reference.docx contains lines in [Content_Types].xml to set default mime types for images:

  <Default ContentType="application/xml" Extension="xml"/>
  <Default ContentType="application/pdf" Extension="pdf"/>
  <Default ContentType="image/x-emf" Extension="emf"/>
  <Default ContentType="image/png" Extension="png"/>
  <Default ContentType="image/jpeg" Extension="jpeg"/>
  <Default ContentType="image/gif" Extension="gif"/>

When you edit the reference.docx in Word and save it again, it overwrites the old [Content_Types].xml with a new version that lacks these lines. If you reinsert these lines, the document will open in Word.

This problem could be fixed by having pandoc overwrite the [Content_Types].xml file in the reference.docx. (Though this would take away some flexibility in specifying one's own content types.)

With footnotes, at least part of the problem seems to be a missing reference to footnotes.xml in word/_rels/document.xml.rels. Again, this reference is present in the reference docx but gets overwritten by Word. Perhaps, again, we could solve the problem by having pandoc overwrite this file. However, I haven't been able to fix things just by editing that file, so perhaps there is another problem.

Aha - I figured out the other piece of the footnotes problem: the entry for footnotes.xml in [Content_Types].xml was also being dropped. When I put this in, it worked:

  <Override PartName="/word/footnotes.xml"
  ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml" />

Everything except the issue with style names/IDs has been fixed.

I'm seeing what appears to be a possibly related issue to this (or perhaps to issue 780?): when I generate a docx via Pandoc (v. 1.11.1, though I've seen it for a long while) with footnotes, the first generate action is fine. If I make changes to that document, save it, and attempt to use it as a template, I get the following message from Word:

The Open XML file test.docx cannot be opened because there are problems with the contents or the file name might contain invalid characters (for example, /).

Details

One or more footnotes referenced in the document are missing or corrupt.

Next, Word displays another prompt:

Word found unreadable content in test.docx. Do you want to recover the contents of this document? If you trust the source of this document, click Yes.

If I click Yes, Word _opens_ the file just fine, albeit in a new, unnamed document. Interestingly enough, it actually gets the template information correct, as near as I can tell. It just refuses to open the original, then correctly imports its content. I can then save the new document, and everything works as expected from that point forward. The initial conversion with a reference seems to be including _something_ that Word 2011 doesn't like, though I'm not sure what.

I'm using Word 2011 on OS X 10.8.4. The command I am running is pandoc test.md -o test.docx. This happens with any document where I try to use footnotes with a reference document. I can happily supply test files which demonstrate the issue (at least on my setup) if that's helpful.

_(Edited to add some details.)_

@chriskrycho - I can reproduce this with pandoc 1.11.1:

% /usr/local/bin/pandoc README -o r11.docx
% open r11.docx # edit this and save in Word
% /usr/local/bin/pandoc --reference-docx r11.docx README -o r12.docx
% open r12.docx  # produces error message about corrupted file

However, I believe the problem has been solved in the development version. At least, I could not reproduce it there with the same steps. Commit bd1079e48e055f7b58ce13be3dfa8b5c5cb5ba7c was designed to limit the damage Word could do by reorganizing docx files.

Excellent. I'll look forward to that in the next release! (I'm quite capable of but disinclined to spend the time getting Haskell up and building the development version myself.)

@jgm I appear to be having the same problem @chriskrycho had, using Pandoc 1.12.1 on a Mac with Microsoft Word for Mac 2011 Version 14.4.1. Getting exactly the same error message and behavior he got, with the same steps to reproduce the problem.

Why don't you try with the latest pandoc and see if the problem has
been fixed?

+++ W. Caleb McDaniel [Jun 28 14 12:56 ]:

[1]@jgm I appear to be having the same problem [2]@chriskrycho had,
using Pandoc 1.12.1 on a Mac with Microsoft Word for Mac 2011 Version
14.4.1. Getting exactly the same error message and behavior he got,
with the same steps to reproduce the problem.

—
Reply to this email directly or [3]view it on GitHub.

References

  1. https://github.com/jgm
  2. https://github.com/chriskrycho
  3. https://github.com/jgm/pandoc/issues/414#issuecomment-47436891

Upgraded to Pandoc 1.12.4.2, but still having the same problem.

@wcaleb, I have the same version of Word for Mac as you do, and I cannot reproduce the problem with this sequence:

% pandoc README -o r11.docx
% open r11.docx # edit this and save in Word
% pandoc --reference-docx r11.docx README -o r12.docx
% open r12.docx  # produces error message about corrupted file

r12.docx seems fine and Word has no complaints. Did you do something different? What edits did you do to the word document? Perhaps you could send/attach/link to a version that causes the problem.

For what it's worth, I just did it with a semi-random file I had available and it failed. It seems to crop up any time I have footnotes in my Markdown document.

Here's an extremely simple sample file with which I just tested it:

This is a trivial Markdown document. It has one footnote.[^1]

[^1]: Yep. Just the one.

And the calls I made:

$ pandoc test.md -o t1.docx
$ open t1.docx
# I then applied Quick Style settings to the document, saved it, and closed it.
$ pandoc test.md -o t2.docx --reference-docx=t1.docx
$ open t2.docx  # Word fails to open the document, with the same error message

This is running OS X 10.9.3, Word 14.4.3 (140616), pandoc 1.12.4.2 (installed it just before I ran this test again, but I know it also was broken in 1.12.3).

Word can recover the document, and all its contents are intact. It just can't open it directly.

Edit: sorry I didn't report that this was still an issue sooner! I had noted it was not fixed in the 1.12 release, and had meant to get back to you about it, but it slipped my mind.

@jgm I did the exact sequence you describe, using the latest Pandoc README from Github as input. I even expanded the full path of the Pandoc executable to make sure there was no weirdness there.

There are no errors if you don't make any changes to r11.docx before using it as a reference file.

But I opened r11.docx and inserted an extra carriage return (not even touching the Styles), saved the file in Word, and then attempted to use r11.docx as a reference file to produce r12.docx. This raises the "One or more footnotes referenced in the document are missing or corrupt" error when I try to open r12.docx. (Though it is still able to get the contents of the new file, just as @chriskrycho describes.)

Perhaps it's relevant that I'm running Mac 10.7.5.

If you'd like me to send you my initial test and reference files, I can.

I did my testing just now with the development version of pandoc. There have been quite a few docx related changes since 1.12.4.2 release. I was not able to reproduce the problems you describe with this version of pandoc. If you're able to compile the current develoment version from source, I'd like to know whether you still have the problem. (Also, be sure you're not inadvertently running an older version of pandoc which is in your path, and check ~/.pandoc to make sure you don't have a custom reference.docx there.)

@jgm Jackpot. Development version works.

Oh good.

+++ W. Caleb McDaniel [Jun 29 14 13:18 ]:

[1]@jgm Jackpot. Development version works.

—
Reply to this email directly or [2]view it on GitHub.

References

  1. https://github.com/jgm
  2. https://github.com/jgm/pandoc/issues/414#issuecomment-47473046

@jgm @jkr This has also given me an excuse to play around with the new DOCX reader, which looks _fantastic_. I've tried it on a bunch of test files already and it hasn't hiccuped yet!

@wcaleb -- thanks! and when you do find the inevitible hiccup, please let me know!

Using pandoc version 1.13.1, windows 7 and office 2013 I get an error similar to those in this thread.
Example files here:
https://www.dropbox.com/l/JhjvUUTtXnJMg0S4JosYtt

The cmds.txt file contains the commands I have used for each of the test scenarios and the result of these test.

This may very well be an error on my part as I am not an it-professional but I am stuck and cant find the solution by searching previous issues.

I am having general problems getting heading styles in the ref doc to be applyed to headers in the markdown file, but I think that may be me using wrong styles in word when creating the reference doc. The main issue is that when having an image included in the .md file and a footer in the reference docx the resulting file is "damaged" (what word says)..

Confirmed on HEAD. Reopening.

The header problem, at least, is because your Reference docx uses "OverskriftN", but has nothing for "HeadingN" defined in the styles file. Not sure what the right approach to do for international word should be.

The solution there seems to be that, regardless of the character, the style name (which is different from the style id) is "heading 1". So we'd have to parse the style file (I already do this in the docx reader), see what the id is with a corresponding "heading n" child name tag, and use that. Doable.

The corruption problem looks like it's because the header and footer and relationship id numbers aren't properly renumbered, so, in your noimg example, when it looks for a footer, it actually gets a theme(!). This is a separate issue from the language one above. I'll work on both separately.

@SandMann1 , are you able to build a development version? I think I've fixed the corruption problem, and tests with your files seem to work. It's available on https://github.com/jkr/pandoc/tree/renumberHeaderFooters

The headers problem is a different issue having to due with classnames in different language. I'll be opening up a new issue to deal with that.

I have colleagues able to do so. Will ask for their help to test on monday. Thank you!

edit:
Tried to build development version using this guide:
https://github.com/jgm/pandoc/wiki/Installing-the-development-version-of-pandoc#running-pandoc

While running "cabal install --force --enable-tests" I get an ExitFailure:
failedpandoc build

Got it to build after reinstalling Haskell platform. But the built pandoc.exe errors with the message: "pandoc.exe: Could not find datafile reference.docx" when using the same command as stated in examplefiles: "pandoc noimg_test.md -f markdown -t docx -S -o f_noimg_test.docx --reference-docx=ref_f.docx"

Hmmm... that's a different problem, about the templates installing -- something about building and installing on windows, not related to this fix. I'm not sure the best way to fix the build, but can you try running it with

pandoc --data-dir=path/to/data (where data is the directory in the pandoc tree containing the default "reference.docx")

If that still produces the same problem, I'll probably just push this, since it seems to work for me.

I'm sorry. Same problem. Probably something I have done wrong during the build process or something. Far beyond the limits of my abilities I have to realize defeat.

Looking forward to future releases of pandoc. :)

I recommend using

cabal install -fembed_data_files

This will embed the data files in the binary, eliminating one
source of potential problems.

+++ SandMann1 [Sep 10 14 05:40 ]:

I'm sorry. Same problem. Probably something I have done wrong during the build process or something. Far beyond the limits of my abilities I have to realize defeat.

Looking forward to future releases of pandoc. :)


Reply to this email directly or view it on GitHub:
https://github.com/jgm/pandoc/issues/414#issuecomment-55108905

Well, I've been playing with it in a Mac, and it definitely seems to solve the corruption problem. The headers problem, as I've mentioned, is a lang/style issue (what do we do when the header styles in the reference.docx have different names, and HeadingN isn't there at all?). But that is a different issue, to be discussed in #1607.

So, closing with 020a527c15121087b057c7d3054467e3a499a94d

Hi all,

I've been trying to create DOCX and ODT files using pandoc 1.13.2.1 on my linux (xubuntu) machine, and I'm encountering the same problem described here. Both Word and LibreOffice say the file is corrupt if I change the styles in the reference doc. Sadly, in my case some of the formatting is broken after I let Word or LibreOffice "fix" the corruption. This may have something to do with the fact that I use RTL styling (I write in Hebrew).

Sadly, I'm not tech savvy enough to understand the solutions suggested earlier. I can compile from source if I have a step by step explanation, I guess, though I'm not sure if the developer version mentioned earlier is still relevant, or if it's already implemented in 1.13.2.1.

I really hope there's a solution for this, because I'm really looking for a way to easily produce MSWord compatible documents on linux and have yet to find another way.

@eladhen, please consider opening a new issue for this.

Also could you share input files (Markdown and modified reference.docx), corrupted output file, and exact command you use to produce output? You can send those to my e-mail, if you wish (see github profile). If any of those are confidential, try to produce a simple mock version in order to demonstrate this problem.

Thanks.

Well, it seems that if my reference file doesn't include a footnote I don't get the "corrupt file" error from Word. This is still a bug I guess, but I can live with it.

The right-to-left formatting bug, however, remains. Which means, I guess, that it's a completely different bug, and I'll open a new bug the minute I understand how to do it. As for the problem with LibreOffice, it was my bad. I wrote:


pandoc test02.md -f markdown -t Docx -s -o test02.odt

so I guess it formatted it as a docx file and that was the source of the error.

Sorry for the bother.

@eladhen, I'd still like to test your files against current development version of Pandoc, if that's not too much to ask. If there are problems with footnotes, I'd prefer to know about those, and maybe fix them if possible.

As for opening new issue, there's a green button near the top of this page, titled "New Issue".

@eladhen,
@lierdakil is more familiar with this, but I'm wondering how the reference.docx you're modifying was created; if it was not generated by Pandoc, could you test again (if you have time):

  1. Generate a simple docx using Pandoc.
  2. Change the styles etc. (including RTL and footnotes) as needed in this docx.
  3. Use this as the reference.docx and create a test.docx.
  4. See whether this test.docx opens fine in Word and LibreOffice.

Please provide the command lines and resulting docx if possible.

Hmm... can I upload files to this platform (sorry for the noobness)?

(and yes, I edited a reference.docx generated by pandoc).

@eladhen Thanks. Many use dropbox.com or similar sites to upload and then share the link here.

@eladhen You could give this method a try too, since you already have an account on GitHub.
http://hanxue-it.blogspot.com/2014/06/how-to-upload-image-binary-file-to-gist.html

@eladhen, only images are supported in issues. So no, you'll have to find another way. That's why I initially suggested e-mail. Dropbox/Google Drive/Microsoft OneDrive would work as well.

Here goes:

I wrote this file using gedit. Notice that I put a footnote there. Without one in the reference.docx this problem doesn't crop up:
https://www.dropbox.com/s/f5z8q8dl0jypvh9/original.md?dl=0

I converted it with this command:
pandoc original.md -f markdown -t Docx -s -o first.docx

This is the output I got (which Word opens without a problem):
https://www.dropbox.com/s/g9s0nbb4w9b5j0f/first.docx?dl=0

I opened it on my windows machine with Word 2010, restyled it, and saved it as a reference.docx:
https://www.dropbox.com/s/ana7oqk0jqng6i9/reference.docx?dl=0

I put the reference in /.pandoc on my Home directory.

Then I wrote this new simple file in gedit:
https://www.dropbox.com/s/doc6otu1hggs2yo/second.md?dl=0

I converted it to docx with this command:
pandoc second.md -f markdown -t Docx -s -o second.docx

This is the output file:
https://www.dropbox.com/s/ti9ioi3y5j2q96c/second.docx?dl=0

When I open it in word I get this error:
https://www.dropbox.com/s/xra29t0067h104g/error.png?dl=0

I'm sorry it's in Hebrew. I'm not sure how to change this. It says something along the line of:

"couldn't open the file second.docx because there are problems with its contents.
Details: one or more footnote in the document is missing or corrupted."

When I let word try and open it anyway it is opened but some of my new styling is lost (font and color changes I made).

As I said earlier, there are also some problems with the Hebrew formatting, but these crop up regardless of if I have a footnote in my reference file, so I'm pretty sure it's an unrelated issue.

@eladhen Thank you - I'll get back to you later today.

@eladhen Could you please test again with the following on your system? There's a slight difference when reference.docx is specified this way:

pandoc second.md --reference-docx=referenc.docx -o third.docx

It opens fine (on Word 2011 on Mac) when I do this, but I cannot confirm whether it has the styles/properties you need.

414-third

Now that I look at it, it seems very much like a bug @jkr fixed a while back with https://github.com/jgm/pandoc/commit/ba59e5447ffe8d15dd7fee69a7aa03706ce3c49b. I will make sure to check for possible regressions though.

I copied reference.docx to my working directory and wrote:

pandoc second.md --reference-docx=reference.docx -o third.docx

And got this file:
https://www.dropbox.com/s/5g3qvpow75mbetw/third.docx?dl=0

It's giving me the same error message.

@eladhen Could you please delete or rename the reference.docx in your .pandoc first and then test it?

Hmm, this seems to have worked!

https://www.dropbox.com/s/5g3qvpow75mbetw/third.docx?dl=0

This file doesn't give an error message. Still having problems with Hebrew, but I'll open a new bug report for that as I don't think it's related.

@eladhen Thanks for testing. I'll look into this (and the new one) later today.

@eladhen If it is not too much trouble, could you please upload a typical Hebrew Word document with styles and settings you'd like to have? It'll help me with troubleshooting. Please clean the document of any personal/confidential info before uploading (see https://technet.microsoft.com/en-us/magazine/ff936144.aspx).

It's not to much trouble, but I think it'll be a bit difficult to understand the very specific formatting issue I see without me pointing them out. I'll try to open a report about it tomorrow and I'll include a sample. Thank you for the help.

@eladhen You're most welcome.
You may already be familiar with screen capture tools like these - they could be handy in creating your report:
https://www.techsmith.com/jing.html
http://www.cockos.com/licecap/

I'm having this problem. Running pandoc 1.16.0.2 on OSX 10.11.3.

  1. create a simple docx file with pandoc (pandoc bananas.txt -f markdown -o bananas.docx)
  2. moved bananas.docx to ~/.pandoc and renamed it to reference.docx.
  3. created another copy of bananas.docx and it opens without issue.
  4. edit reference.docx in Word 15.18 and get "The Open XML file bananas.docx cannot be opened because there are problems with the contents or the filename might contain invalid characters (for example, /). Details: Microsoft Office cannot open this file because some parts are missing or invalid"

https://dl.dropboxusercontent.com/u/765401/pandoc-files.zip

Can you upload your reference.docx? Or the source for it
(bananas.txt)?

+++ fstorr [Feb 06 16 12:50 ]:

I'm having this problem. Running pandoc 1.16.0.2 on OSX 10.11.3.

  1. create a simple docx file with pandoc (pandoc bananas.txt -f
    markdown -o bananas.docx)
  2. moved bananas.docx to ~/.pandoc and renamed it to reference.docx.
  3. created another
  4. edit reference.docx in Word 15.18 and get "The Open XML file
    bananas.docx cannot be opened because there are problems with the
    contents or the filename might contain invalid characters (for
    example, /). Details: Microsoft Office cannot open this file
    because some parts are missing or invalid"

[1]https://dl.dropboxusercontent.com/u/765401/pandoc-files.zip

—
Reply to this email directly or [2]view it on GitHub.

References

  1. file:///var/folders/hn/z9_tw0y54358075qvw479qfc2qx5bb/T//text file, original bananas.docx, and edited reference.docx files
  2. https://github.com/jgm/pandoc/issues/414#issuecomment-180862761

They should all be in the zip file on Dropbox I linked to. bananas.txt is the file I used to create the initial .docx file that I used to create reference.docx (which is also in the zip file)

@fstorr I'm able to download the zip file by copying and pasting the address - clicking on the link fails, as for some reason the link points to https://github.com/jgm/pandoc/issues/text%20file,%20original%20bananas.docx,%20and%20edited%20reference.docx%20files.

I used markdown syntax to create the link but for some reason GitHub didn't like it. I've edited the link so it's just a link. Refreshing the page should fix it.

@fstorr The reference.docx in the zip opens fine on my Mac Word - should it have failed?

@nkalvi - I believe the problem arose from trying to open
a docx created using that reference.docx -- not the
reference.docx itself.

[1]@fstorr The reference.docx in the zip opens fine on my Mac Word -
should it have failed?
[2]pandoc414

—
Reply to this email directly or [3]view it on GitHub.

References

  1. https://github.com/fstorr
  2. https://cloud.githubusercontent.com/assets/2199312/12869831/7e6b0e92-ccf8-11e5-8939-0076bb67f64f.png
  3. https://github.com/jgm/pandoc/issues/414#issuecomment-180875886

@jgm - thanks. Now I see that the resulting docx gives the error mentioned.

@jgm I tried validation check from OOXML SDK 2.5 on the resulting docx. There were 17 errors, but the relevant ones seem to be related to endnotes in settings.xml:

<w:endnotePr>
        <w:endnote w:id="-1" />
        <w:endnote w:id="0" />
    </w:endnotePr>

endnotes.xml is present in reference.docx but not in the one based on it.

On the off chance this is relevant: I installed pandoc using the homebrew package management system.

@jgm Resulting docx opens fine if I make the following changes:

  1. Include endnotes.xml
  2. Copy the endnotes Override entry in [Content_Types].xml from original.
  3. Create appropriate entry for endnotes in word\_rels\document.xml.rels

I hope this helps narrowing down the issue.

@fstorr I get the same error, installing 'directly' from Pandoc's OS X package.

@nkalvi @jgm : I encountered the same problem with footnotes, that I tried to fix with commit ba59e544 and related. That might help;.

Thanks @jkr. Would you be able to add a similar correction when the settings.xml includes reference to endnotes? (I haven't learnt Haskell yet)

pinging @lierdakil -- I vaguely remember there were some problems resulting from the footnote fix I committed with ba59e54, and I think he fixed them. (I was pretty checked out with work at the time.) Nikolay, do you remember anything about this, and how we should best implement the fix for footnotes endnotes?

@jkr, details are fuzzy at best. https://github.com/jgm/pandoc/pull/2034 is the fix you're talking about. But that was reference.docx producing errors, not Pandoc's output.

It's been a while since I last worked with OOXML, so I'm not even sure how endnotes are implemented there...

Okay, I compared first-pass and second-pass outputs. Diff is below. It seems that the problem isn't endnotes, but duplication of nodes in numbering.xml.

--- first-pass/numbering.xml    2016-02-07 22:40:39.551465064 +0300
+++ second-pass/numbering.xml   2016-02-07 22:40:39.551465064 +0300
@@ -163,10 +163,94 @@
       </w:pPr>
     </w:lvl>
   </w:abstractNum>
+  <w:abstractNum w:abstractNumId="990">
+    <w:nsid w:val="4d351f3e" />
+    <w:multiLevelType w:val="multilevel" />
+    <w:lvl w:ilvl="0">
+      <w:numFmt w:val="bullet" />
+      <w:lvlText w:val="" />
+      <w:lvlJc w:val="left" />
+      <w:pPr>
+        <w:tabs>
+          <w:tab w:val="num" w:pos="0" />
+        </w:tabs>
+        <w:ind w:left="480" w:hanging="480" />
+      </w:pPr>
+    </w:lvl>
+    <w:lvl w:ilvl="1">
+      <w:numFmt w:val="bullet" />
+      <w:lvlText w:val="" />
+      <w:lvlJc w:val="left" />
+      <w:pPr>
+        <w:tabs>
+          <w:tab w:val="num" w:pos="720" />
+        </w:tabs>
+        <w:ind w:left="1200" w:hanging="480" />
+      </w:pPr>
+    </w:lvl>
+    <w:lvl w:ilvl="2">
+      <w:numFmt w:val="bullet" />
+      <w:lvlText w:val="" />
+      <w:lvlJc w:val="left" />
+      <w:pPr>
+        <w:tabs>
+          <w:tab w:val="num" w:pos="1440" />
+        </w:tabs>
+        <w:ind w:left="1920" w:hanging="480" />
+      </w:pPr>
+    </w:lvl>
+    <w:lvl w:ilvl="3">
+      <w:numFmt w:val="bullet" />
+      <w:lvlText w:val="" />
+      <w:lvlJc w:val="left" />
+      <w:pPr>
+        <w:tabs>
+          <w:tab w:val="num" w:pos="2160" />
+        </w:tabs>
+        <w:ind w:left="2640" w:hanging="480" />
+      </w:pPr>
+    </w:lvl>
+    <w:lvl w:ilvl="4">
+      <w:numFmt w:val="bullet" />
+      <w:lvlText w:val="" />
+      <w:lvlJc w:val="left" />
+      <w:pPr>
+        <w:tabs>
+          <w:tab w:val="num" w:pos="2880" />
+        </w:tabs>
+        <w:ind w:left="3360" w:hanging="480" />
+      </w:pPr>
+    </w:lvl>
+    <w:lvl w:ilvl="5">
+      <w:numFmt w:val="bullet" />
+      <w:lvlText w:val="" />
+      <w:lvlJc w:val="left" />
+      <w:pPr>
+        <w:tabs>
+          <w:tab w:val="num" w:pos="3600" />
+        </w:tabs>
+        <w:ind w:left="4080" w:hanging="480" />
+      </w:pPr>
+    </w:lvl>
+    <w:lvl w:ilvl="6">
+      <w:numFmt w:val="bullet" />
+      <w:lvlText w:val="" />
+      <w:lvlJc w:val="left" />
+      <w:pPr>
+        <w:tabs>
+          <w:tab w:val="num" w:pos="4320" />
+        </w:tabs>
+        <w:ind w:left="4800" w:hanging="480" />
+      </w:pPr>
+    </w:lvl>
+  </w:abstractNum>
   <w:num w:numId="1">
     <w:abstractNumId w:val="0" />
   </w:num>
   <w:num w:numId="1000">
     <w:abstractNumId w:val="990" />
+  </w:num>
+  <w:num w:numId="1000">
+    <w:abstractNumId w:val="990" />
   </w:num>
 </w:numbering>

Hmm... so looking over that, it doesn't seem like it would hurt anything if I put in a similar endnote change to the writer. I'll test it out later and see if it works.

@jkr The SDK tool can be helpful in validating (you may be aware of it already).

Hmm. I think I misunderstood the original problem. OSX Word 15 adds in word/endnotes.xml it seems, and apparently removes word/_rels/footnotes.xml.rels? I have no idea why that happens, and I don't have access to Windows Word at the moment, never mind OSX. So can't be of much help there.

I am having the same problem with Word flagging documents converted based on my reference.docx as corrupt that seems to have been going on for several years. If I create a clean reference.docx using pandoc, I can use it to create new Word documents that open cleanly. But if I edit and save the reference.docx using Word 2016, my converted documents are flagged as corrupt. Using pandoc 1.17.0.2 and Word 2016 for Mac 15.20. Have the same problem if I edit reference.docx using Word 2011 for Mac 14.6.2.

It would help if you said exactly what you changed in the
reference.docx (even better if you uploaded it so we could
test ourselves).

+++ ibcrosby [Mar 31 16 13:04 ]:

I am having the same problem with Word flagging documents converted
based on my reference.docx as corrupt that seems to have been going on
for several years. If I create a clean reference.docx using pandoc, I
can use it to create new Word documents that open cleanly. But if I
edit and save the reference.docx using Word 2016, my converted
documents are flagged as corrupt. Using pandoc 1.17.0.2 and Word 2016
for Mac 15.20. Have the same problem if I edit reference.docx using
Word 2011 for Mac 14.6.2.

—
You are receiving this because you were mentioned.
Reply to this email directly or [1]view it on GitHub

References

  1. https://github.com/jgm/pandoc/issues/414#issuecomment-204103864

The problem doesn't depend on what change I make to the default reference.docx generated by pandoc. I can open it up, enter and delete a space in the body, then save it, and the saved version will be about half again as large as the original, and documents created by pandoc using it as a reference will generate a corruption error from Word. I've uploaded examples of both here.
reference.pandoc.docx
reference.msword.docx

As mentioned before, I think it is related to endnotes; I'm not sure whether @jkr looked at a fix.

Just confirmed that the same thing happens with Word 2016 for Windows. Interestingly, though, I get the following error message when trying to save the modified version of the reference.docx generated by pandoc:

image

I'm having a similar problem. I'm completely new to pandoc, so installed the latest version (1.19.1) about a months ago. I also have Microsoft Word 2010. At first I tried some conversions from markdown to docx using default pandoc settings - everything worked perfectly. Today I decided to create my custom reference.docx. I followed all the steps in pandoc manuals. What I did:

  1. Created reference.docx using pandoc --print-default-data-file reference.docx > reference.docx.
  2. Opened it in Word and edited only styles, didn't touch any text at all. I only changed the most basic styles: title, body, footnote, headings 1-4.

When I convert my .md file using this changed reference.docx, I get the following error: "The file cannot be opened because there are problems with the content".

When I click "Details" I see: "Microsoft Word could not open this file because some parts are missing or invalid".

Then, Word offers to repair the content, I click "yes", and my file opens just fine, with all the formatting I set up in reference.docx. It also displays "Show Repairs" window that says "Endnote 1". If I click "go to", it says "This bookmark does not exist". Funnily enough, my document does not have any endnotes.

I tried creating a new reference.docx using pandoc as before. If I don't change it, it works perfectly - but, of course, it doesn't have the styles I need. As soon as I change any style at all, the same error returns.

Unlike some previous posters, my problem does not seem to be related to footnotes. I tried converting test document with or without footnotes, but it doesn't make any difference - still the same error. The error seems to be about endnotes, which I don't use!

After Word repairs the document, I can use it just fine, and all formatting looks OK, but this error is annoying, and I'm worried about longer and more complex documents.

@kimlika could you attach or link to the reference.docx you
created, so we can test further? ALso, was it Word 2010 for
Mac or Windows?

+++ kimlika [Jan 05 17 12:59 ]:

I'm having a similar problem. I'm completely new to pandoc, so
installed the latest version (1.19.1) about a months ago. I also have
Microsoft Word 2010. At first I tried some conversions from markdown to
docx using default pandoc settings - everything worked perfectly. Today
I decided to create my custom reference.docx. I followed all the steps
in pandoc manuals. What I did:

  1. Created reference.docx using pandoc --print-default-data-file
    reference.docx > reference.docx.
  2. Opened it in Word and edited only styles, didn't touch any text at
    all. I only changed the most basic styles: title, body, footnote,
    headings 1-4.

When I convert my .md file using this changed reference.docx, I get the
following error: "The file cannot be opened because there are problems
with the content".

When I click "Details" I see: "Microsoft Word could not open this file
because some parts are missing or invalid".

Then, Word offers to repair the content, I click "yes", and my file
opens just fine, with all the formatting I set up in reference.docx. It
also displays "Show Repairs" window that says "Endnote 1". If I click
"go to", it says "This bookmark does not exist". Funnily enough, my
document does not have any endnotes.

I tried creating a new reference.docx using pandoc as before. If I
don't change it, it works perfectly - but, of course, it doesn't have
the styles I need. As soon as I change any style at all, the same error
returns.

Unlike some previous posters, my problem does not seem to be related to
footnotes. I tried converting test document with or without footnotes,
but it doesn't make any difference - still the same error. The error
seems to be about endnotes, which I don't use!

After Word repairs the document, I can use it just fine, and all
formatting looks OK, but this error is annoying, and I'm worried about
longer and more complex documents.

—
You are receiving this because you were mentioned.
Reply to this email directly, [1]view it on GitHub, or [2]mute the
thread.

References

  1. https://github.com/jgm/pandoc/issues/414#issuecomment-270755804
  2. https://github.com/notifications/unsubscribe-auth/AAAL5KLPYErZsZCcgZhPKKO3Ri1HH5XJks5rPVmggaJpZM0GPQ==

Here's my reference.docx. It's Word 2010 on Windows 7 64-bit.

reference.docx

@kimlika I tried with your reference.docx and had no trouble creating a document that opened properly in Word (15.28 for Mac). Unfortunately I don't have Word 2010 for Windows to test with.

Tested with Word 2010 under Windows 7 Pro - didn't get any errors.
@kimlika Could you please upload the document that gave the error?

I will still have this problem.

Here is the related files:

download.docx
reference.docx

I'm still having this problem too. I kind of gave up and learned to live with it ;)

@zhangtemplar @kimlika which version of pandoc are you using?

It looks like it was fixed in 2.2 (see #4621) - I just tested on a Mac with a fresh installation of pandoc (2.2.1) with:

  1. pandoc --print-default-data-file reference.docx > custom-reference.docx
  2. Open custom-reference.docx in Word 2016 and save under new name.
  3. pandoc --reference-doc=custom-reference-saved-from-word.docx test.md -o test.docx
  4. Open test.docx in Word 2016 - it opened fine.

When I compared settings.xml inside test.docx and download.docx, I noticed that download.docx had:

    <w:endnotePr>
        <w:endnote w:id="-1" />
        <w:endnote w:id="0" />
    </w:endnotePr>

which is not present in _test.docx_. So it looks like you might have used an older version.

@nkalvi

pandoc 1.12.4.2
Compiled with texmath 0.6.6.1, highlighting-kate 0.5.8.5.
Syntax highlighting is supported for the following languages:
    actionscript, ada, apache, asn1, asp, awk, bash, bibtex, boo, c, changelog,
    clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css, curry, d,
    diff, djangotemplate, doxygen, doxygenlua, dtd, eiffel, email, erlang,
    fortran, fsharp, gcc, gnuassembler, go, haskell, haxe, html, ini, isocpp,
    java, javadoc, javascript, json, jsp, julia, latex, lex, literatecurry,
    literatehaskell, lua, makefile, mandoc, markdown, matlab, maxima, metafont,
    mips, modelines, modula2, modula3, monobasic, nasm, noweb, objectivec,
    objectivecpp, ocaml, octave, pascal, perl, php, pike, postscript, prolog,
    pure, python, r, relaxngcompact, restructuredtext, rhtml, roff, ruby, rust,
    scala, scheme, sci, sed, sgml, sql, sqlmysql, sqlpostgresql, tcl, texinfo,
    verilog, vhdl, xml, xorg, xslt, xul, yacc, yaml
Default user data directory: /root/.pandoc
Copyright (C) 2006-2014 John MacFarlane
Web:  http://johnmacfarlane.net/pandoc
This is free software; see the source for copying conditions.  There is no
warranty, not even for merchantability or fitness for a particular purpose.

@nkalvi OK, the container I am running is based on Jessie, which only has pandoc 1.12.4.2.

@zhangtemplar Though the generated reference doc from pandoc (even the older version) didn't have endnote - Word added styles for endnote when you saved after changing it. Newer versions of pandoc strips this out. Pandoc didn't add relevant entries and files for this endnote style when generating docs based on the saved file, causing the error.

In case you cannot update the version you have in your system, could you try this workaround:

  1. Install the current version of pandoc on another system
  2. Create another reference.doc based on the one you have - f.ex.: pandoc --reference-doc=current-reference-saved-from-word.docx simple.md -o new-reference.docx
  3. Use that new-reference.docx in your current system (with old pandoc)

@nkalvi I managed to install the latest version of pandoc via the binary package.

@zhangtemplar Hope that solved the issue - please let us know if possible.

@nkalvi after install the 2.2 version, the bug is gone.

@zhangtemplar Very good! - thanks for reporting.

Having this issue in Pandoc 2.7.3, Mac OS 10.14 and Office Version 16. I generated the reference doc and modified it to add header and footer. Everything works fine until I add an image and the doc shows as corrupted. Works after recovering though.

Using pandoc for HTML to docx conversion.

If you can create a minimum example showing the problem, I'd open a new issue, since it would be a different one from this one anyway.

When I set the options like below:

fig.width = 4, fig.height = 4.

The docx output ignore the options and automatically set up the figure size around 5.xx -ish.
I tried to set up the options on the YAML part instead, still ignored.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

elliottslaughter picture elliottslaughter  Â·  44Comments

jgm picture jgm  Â·  117Comments

jgm picture jgm  Â·  62Comments

uvtc picture uvtc  Â·  47Comments

matthijskooijman picture matthijskooijman  Â·  54Comments