Pandoc: Docx to markdown - images are not exported with --export-media with certain types of docx

Created on 10 Jul 2019  路  7Comments  路  Source: jgm/pandoc

Issue
I've encountered a type of Docx file (ones that are exported from Quip), which do not export their images when using --export-media and -t markdown.
However, if the Docx is loaded into Word application, and then saved out, then the images will correctly export. This suggests it might be a file formatting issue, but the document renders fine in Word, and I compared the document.xml in these two files however I couldn't spot any distinct different in the structures.

Test Files
I have attached two files:
Test.docx - the original exported file, containing 2 embedded images
Test2.docx - the original exported file, loaded into and then saved out from Word

Reproduction
pandoc "Test.docx" --verbose --extract-media=test_media --atx-headers -f docx -t markdown -o "Test.md"
Result: No images are exported
Expected: two images to be exported to test_media folder

pandoc "Test2.docx" --verbose --extract-media=test_media2 --atx-headers -f docx -t markdown -o "Test2.md"
Result: 2 images are exported to test_media2 folder, as expected.

[INFO] Extracting test_media2\media\image1.png...
[INFO] Extracting test_media2\media\image2.png...

Environment
Running Pandoc version 2.7.3 on Windows 10, 64-bit.

Attachments
Test.docx
Test2.docx

Thanks for a great tool.

Docx writer

Most helpful comment

To help with comparing docx files, I wrote a little shell script, https://github.com/jgm/diff-docx
This saves the trouble of unzipping and tidying.
(I've now put this in the tools/ directory of this repository instead of its own repository.)

All 7 comments

The relevant part from Test.docx:

    <wp:docPr id="10" name="media/JIcACABwiXP.png"/>
    <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
      <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
        <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
          <pic:nvPicPr>
            <pic:cNvPr id="0" name="media/JIcACABwiXP.png"/>
            <pic:cNvPicPr/>
          </pic:nvPicPr>
          <pic:blipFill>
            <a:blip r:embed="rId10"/>
            <a:stretch>
              <a:fillRect/>
            </a:stretch>
          </pic:blipFill>
          <pic:spPr>
            <a:xfrm>
              <a:off x="0" y="0"/>
              <a:ext cx="5352176" cy="4219662"/>
            </a:xfrm>
            <a:prstGeom prst="rect"/>
          </pic:spPr>
        </pic:pic>
      </a:graphicData>
    </a:graphic>

Probably related (or the same issues): https://github.com/jgm/pandoc/issues/1810 and https://github.com/jgm/pandoc/issues/5394

Do you know by what tool or word version the docx was generated?

The tool that generated Test.docx was the Salesforce Quip app. Entirely possible their exported docx markup is somehow at fault here, but it did seem like a valid docx so thought I'd report it as an issue here.

Test2.docx was generated by Word for Office 365, V16.0 32-bit, simply by opening Test.docx and then "saving as" Test2.docx - no other modifications to the doc.

The xml for Test2.docx is:

<wp:docPr id="9" name="media/JIcACA7YtNb.png"/>
<wp:cNvGraphicFramePr/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
    <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
        <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
            <pic:nvPicPr>
                <pic:cNvPr id="0" name="media/JIcACA7YtNb.png"/>
                <pic:cNvPicPr/>
            </pic:nvPicPr>
            <pic:blipFill>
                <a:blip r:embed="rId5"/>
                <a:stretch>
                    <a:fillRect/>
                </a:stretch>
            </pic:blipFill>
            <pic:spPr>
                <a:xfrm>
                    <a:off x="0" y="0"/>
                    <a:ext cx="5352176" cy="2961313"/>
                </a:xfrm>
                <a:prstGeom prst="rect">
                    <a:avLst/>
                </a:prstGeom>
            </pic:spPr>
        </pic:pic>
    </a:graphicData>
</a:graphic>

Ah yes, they indeed look similar. Probably the key is in the document.xml.rels files which contains also:

<Relationship Id="rId10" Target="media/JIcACABwiXP.png" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"/>

Btw. in neither LibreOffice nor Apple Pages the images show up...

Useful to know they dont render properly in other apps since that's indicative of some type of malformed document, in which case its definitely an issue for Quip.

Posting up the xml for reference...
Test.docx document.xml.rels

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/>
  <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/>
  <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.jpg"/>
  <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/>
  <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>
  <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
  <Relationship Id="rId2" Type="http://schemas.microsoft.com/office/2007/relationships/stylesWithEffects" Target="stylesWithEffects.xml"/>
  <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
  <Relationship Id="rId9" Target="media/JIcACA7YtNb.png" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"/>
<Relationship Id="rId10" Target="media/JIcACABwiXP.png" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"/>

</Relationships>

Test2.docx document.xml.rels

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>
    <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/>
    <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/>
    <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
    <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
    <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image2.png"/>
    <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
    <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/>
</Relationships>

I will try to see if I can isolate which difference in the files is the cause of this.

To help with comparing docx files, I wrote a little shell script, https://github.com/jgm/diff-docx
This saves the trouble of unzipping and tidying.
(I've now put this in the tools/ directory of this repository instead of its own repository.)

Ah, it looks like the repository https://github.com/jgm/diff-docx is removed in favor of pandoc/tools/diff-zip.sh (see 83a0104).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GeraldLoeffler picture GeraldLoeffler  路  143Comments

anton-k picture anton-k  路  53Comments

jclement picture jclement  路  117Comments

brainchild0 picture brainchild0  路  66Comments

nrnrnr picture nrnrnr  路  49Comments