I think this is pretty self-explanatory, but the ability to set custom properties in a Word document would be a total game-changer, especially for those of us working for companies so entrenched in the MS garden that we use SharePoint as a DMS. I have hundreds of md files that I'd love to convert to Word and have them retain the custom key in the key-value pair. Then it's just a drag and drop into a SharePoint library, which should pick up those custom properties and map them correctly for list views, etc.
I appreciate this is a very big ask. Cheers.
@jkr
+++ Ryan Watters [Jul 20 16 10:14 ]:
I think this is pretty self-explanatory, but the ability to set custom properties in a Word document would be a total game-changer, especially for those of us working for companies so entrenched in the MS garden that we use SharePoint as a DMS. I have hundreds of md files that I'd love to convert to Word and have them retain the custom key in the key-value pair. Then it's just a drag and drop into a SharePoint library, which should pick up those custom properties and map them correctly for list views, etc.
I appreciate this is a very big ask. Cheers.
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/jgm/pandoc/issues/3034
@rdwatters Would you be able to upload a docx file with some of these properties, and a pointer to which particularly properties you're interested in? What you're requesting might well be possible, but I'd have to see where in the maze of xml those properties reside.
Absolutely @jkr. I'll put together some screenshots at work later today.
@jkr @jgm
Here is a link to a docx created in MS Word for Mac Version 15.24; my hope is that these custom properties do not change depending on whether it's Windows/OSX:
https://www.dropbox.com/s/86qs1hfng6l601r/pandoc-sample.docx?dl=0
Not sure if it helps, but...




I guess I would have to dig deeper to see how these different key-values would map, but things like "keywords" in the .docx property (usually "tags" in a .md .yml front matter) seem to be equivalent.
I know I'm throwing out a _ton_ in this single comment, so let me know how I can help/clarify.
So, just so I understand this -- would you like to be have the toplevel metadata map into the properties?
title: My Title
keywords: publishing, microsoft, etc.
...
Blah blah blah.
or do you want to have the ability to have a _different_ title, like you do in the supplied document?
title: My Title
docx-props:
title: This wasn't added...
keywords: publishing, microsoft, etc..
...
Blah blah blah
I don't know what Sharepoint is, or how 365 works -- are they important to this, or are we just trying to get the metadata into props?
Or is it actually the info under "custom" (pandoc1="wicked", pandoc2="awesom") that you're most concerned with?
There are a lot of properties here, so I want to make sure I'm looking at the right thing.
@jkr Both excellent questions that demonstrate how terrible my examples were. The following is long-winded but thorough. I _really, really_ appreciate you taking the time to look into this.
SharePoint is significant in that its the #1 most popular intranet tool (>50% of Fortune 500 companies) and is MS's native document management system. It is also notoriously difficult to extract content from. Companies invest enormous amounts of time tagging and cataloguing word documentss with metadata that is embedded directly into the document but then is _only easily digestible by a further SharePoint instance_. Ideally, Pandoc would be able to do these conversions bidirectionally between Word and MD - again, I appreciate that this is a _huge_ ask.
Here is a temp repo that houses the word versions used in the following screenshots:
https://github.com/rdwatters/pandoc-word-samples
1. (see word-only.docx). So here are the properties of the doc created in Word. These properties (Title, Subject, Author, Manager, Company, Category, Keywords, Comments) are all out-of-the-box:

Here is a shot of the added Custom Property (for this example, the property is "CustomProperty", data type is text, and value is "Hello Pandoc":

As an aside, I'm using two separate titles to demonstrate how (a) Pandoc pulls from the title in page copy during a
.docx => .mdconversion, whereas MS (for the purpose of its document management system/SharePoint) pulls the title from the properties pane of the .docx. A minor improvement for Pandoc in its.docx => .mdconversion would be to first check for a "Title" property in the properties pane to add to the yaml of a markdown file and then, if the properties pane does not include the title, pull the text styled as "Title" in the word document. Per your question re: adding two separate titles, I don't think adding two separate titles adds much value. Right now, when pandoc converts.md => .docx, it takestitle:from the markdown file's yaml and adds it to the body copy (with appropriate styling) and also to the title field in properties, which is AWESOME.
2. So now that we have a Word document (created locally) with the out-of-box properties and one custom property added, the doc can be be added to a SharePoint list, which admin looks like the following screenshot. Note how "Comments" has been converted to "Description" and the "Creator" column (I'm using DCMI metadata for this example, which is an out-of-the-box content type in SharePoint) is auto-populated from the Author.

3. You can see in the last screenshot that "Publisher" doesn't have a value. These values can be updated within the SharePoint UI and are embedded in the Word document itself:

4. And here is the resulting update in the SharePoint list (keep in mind these lists will often have 10ks of Word documents):

5. (see word-with-properties-added-in-sp.docx) Now that these properties are in the document, I'll download a copy of the doc and retitle it to word-with-properties-added-in-sp.docx. When I open this local copy and go to properties, everything remains, BUT both "Publisher" and "Contributors" cannot be accessed from within Word. They are in the document and will automatically update in a SharePoint list, but they cannot be accessed directly from the property panes directly within the Word app:

6. (COPY-word-with-properties-added-in-sp.docx) And finally, just to double check that both the custom property (ie, CustomProperty) and the properties we added via SharePoint directly (ie, the DCMI "Contributors" and "Publisher") are actually embedded in the document, I'll upload the locally renamed version of the original document as well as a copy to the SharePoint list, which shows the properties (ie, metadata) actually travel with the document:

So what would be the _ideal_ workflow?
Having the document I created above in Word write back to a markdown file with all the out-of-the-box properties (listed above), the custom properties (in this case, just "CustomProperty"), and the other content type properties (in this case, "Publisher" and "Contributors") as typical key-value pairs....and vice versa.
All this insanity because the last 5 years of markdown evangelism have lead to exactly 0 converts in my last three companies.
Hopefully that makes sense. Thanks again!
Having the document I created above in Word write back to a markdown file with all the out-of-the-box properties (listed above), the custom properties (in this case, just "CustomProperty"), and the other content type properties (in this case, "Publisher" and "Contributors") as typical key-value pairs....and vice versa.
So now I'm confused. It seems like you're asking for the ability to go docx -> markdown, while the title suggests you're interested in going markdown -> docx. While I appreciate the completeness of your description, I'm still unsure of what you want pandoc to do. A simple input file with expected output file would definitely help.
Now, assuming you want something bidirectional, let me mention some concerns.
That being said, in this direction, it would be quite simple to write a python script that would produced full yaml from a docx file, and insert it into the resulting markdown file. (It would just require unzipping with zipfile, and parsing xml with etree or the like). I could write a version of that for you when I get a chance.
Agreed w/r/t title of this feature. I have updated it accordingly. Now in terms of your responses:
I've stumbled across this issue today after struggling with docx Custom Properties. It would be amazing if pandoc could handle them. Much like @rdwatters I work at a company that has all of it's documentation in docx format with plenty of custom properties.
If it helps clarify things, here's my immediate use case:
I currently use markdown with YAML at the top of the document like so:
title: MyTitle
author: Sealatron
...
# Heading
## Sub Heading
I convert these to docx using a command like this:
pandoc -s --reference-docx reference.docx -f markdown+yaml_metadata_block -t docx input.md -o output.md
My issue is that there are custom properties in reference.docx that appear in the headers and footers. Just now they transfer across to my output.docx with their original values (thus giving me erroneous values w.r.t. the doc I'm writing), but what I'd ideally like to be able to do is override them somehow in my original input.md like so:
title: MyTitle
author: Sealatorn
custom1: MyCustom1Value
custom2: MyCustom1Value
...
This is of course just an example, I don't know what the correct syntax would be. I'd expect the output to look like:
reference.docx, add them to output.docxreference.docx, override the values with the new ones in output.docx.Obviously like @rdwatters said, this is a big ask (just glancing at the raw xml of the reference.docx told me how weird the format was) but I'd love to see something like this in pandoc in future!
I guess what would really help is the exact XML output needed for different cases...
Agreed, sorry I left that out of my example.
I first created reference.docx by converting an example markdown document to docx. Within Word, I created 'Custom Properties' as shown in @rdwatters comments above. Here's what that looks like within reference.docx::docProps\custom.xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Custom1">
<vt:lpwstr>MyCustom1Value</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Custom2">
<vt:lpwstr>MyCustom2Value</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="4" name="Custom3">
<vt:lpwstr>MyCustom3Value</vt:lpwstr>
</property>
</Properties>
These DocProperty values are inserted using word as 'Fields' in the headers/footers. The xml for those looks like this:
<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> DOCPROPERTY Custom1 \* MERGEFORMAT </w:instrText>
</w:r>
<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r>
<w:t>MyCustom1Value</w:t>
</w:r>
<w:r>
<w:fldChar w:fldCharType="end"/>
</w:r>
After running the pandoc command in my previous comment, these fields remain in the headers/footers of the output.docx, but the corresponding custom.xml doesn't exist.
What I'd like to be able to do is specify custom properties like the above somehow in my original markdown document so that the fields in my output.docx headers/footers will update correctly. This might be as straightforward as creating a custom.xml from the provided YAML?
I too like this feature to be added to the pandoc-docx-writer. However I don't think the issue is 'bite-sized' at this moment. May I suggest the first step towards the full implementation suggested in the above posts?
implementation of all default metadata in the docx-writer in a similar way as the epub writer uses it. Microsoft uses Dublin Core in core.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><dc:title>rkn-titel</dc:title><dc:subject>rkn-onderwerp</dc:subject><dc:creator>rkn-auteur</dc:creator><cp:keywords>rkn-trefwoord1 rkntrefwoord2</cp:keywords><dc:description>rkn-opmerking</dc:description><cp:lastModifiedBy>Ren茅 Knuvers</cp:lastModifiedBy><cp:revision>1</cp:revision><dcterms:created xsi:type="dcterms:W3CDTF">2018-01-11T18:20:00Z</dcterms:created><dcterms:modified xsi:type="dcterms:W3CDTF">2018-01-11T18:24:00Z</dcterms:modified><cp:category>rkn-categorie</cp:category></cp:coreProperties>`
so it would be nice to populate those values from within the YAML-metadatablock in a markdown file, or a separate xml as could be used for EPUB.
---
Title: rkn-title
Subject: rkn-onderwerp
Author: rkn-auteur % note that this will populate the "dc:creator" field
Keywords:
- rkntrefwoord1
- rkntrefwoord2
Revision: 1
Description: rkn-opmerking
---
Supporting all other fields from DCMI (http://dublincore.org/documents/dc-xml-guidelines/) would be nice. This would also include fields for a unique document identifier and a status field.
Step 2 could be implementation of populating 'custom.xml':
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes"><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Afdeling"><vt:lpwstr>rkn-afdeling</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Documentnummer"><vt:lpwstr>rkn-documentnummer</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="4" name="Gecontroleerd door"><vt:lpwstr>rkn-gecontroleerd door</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="5" name="Project"><vt:lpwstr>rkn-project</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="6" name="Status"><vt:lpwstr>rkn-status</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="7" name="Taal"><vt:lpwstr>rkn-taal</vt:lpwstr></property></Properties>'
... but that may be too hard for a simple implementation. Some of the custom fields are actually prepopulated but seem language dependant. I'm Dutch and use a multilingual MAC-Word 16.8 (171210) set to Dutch (nl-NL).
This would be a very nice feature.
At least in the md -> docx direction, as it would add the template-variable-substitution that we have on the template based writers (all save docx, odt, ppt...).
We would be removing the need to edit headers and footers in word after conversion to enter the correct "title" or "department" etc.
Under https://github.com/jgm/pandoc/issues/2839 a commit has been included to support odt custom properties (on the writer). It would be great to have something similar for docx.
Looks like we'd need, in [Content_Types].xml:
<Override PartName="/docProps/custom.xml" ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/></Types>
In .rels/rels:
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/custom-properties" Target="docProps/custom.xml"/>
and then we'd need docProps/custom.xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Foo bar">
<vt:lpwstr>hello there</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Zoopie">
<vt:lpwstr>1123</vt:lpwstr>
</property>
</Properties>
A general question affecting this and #2839: how should we identify custom properties?
In #2839 I just took everything except title, author, date, and lang as a custom property.
But perhaps instead we should look for a special section custom:
custom:
prop1: foo
prop2: bar
This would avoid getting properties like toc: true and so on.
Well, what makes these metadata keys more custom than others? If anything, they seem to belong to the top level, together with title, author, etc. And toc: true would seem to belong to a sub-level, like template: (of course, backwards-compatibility...)
Maybe a question to anybody in this thread: would it hurt to put all properties in the docx? Even toc: true, header-includes, etc? I'm guessing it wouldn't hurt... maybe people using SharePoint in their company have a naming-scheme already with a custom prefix?
Good question.
Personally I like the simplicity of having everything that is not an "official" property of the target format as a custom property, like what was done for #2839
I guess having a special section would work too, although some property duplication in the front matter could occur in this case
Whatever is decided in the end should be the same for ODT and DOCX
Thank you for tackling this!!
Maybe a question to anybody in this thread: would it hurt to put all properties in the docx? Even toc: true, header-includes, etc? I'm guessing it wouldn't hurt...
Probably not. As long as the custom property isn't mapping to any sort of library/site column in SharePoint, it shouldn't matter. (At some point, it has to be the Pandoc _user's_ responsibility to match these.)
Personally I like the simplicity of having everything that is not an "official" property of the target format as a custom property, like what was done for #2839
If I understand this correctly, yes, I agree. The standard properties, I believe, are everything in the "summary," as shown in the screenshot above.
One might, after some digging, start to suspect that Microsoft made this unnecessarily complex intentionally :wink:
@jgm How could I help move this one forward?
I see a couple "core" (core.xml) properties probably missing:
The extended properties (app.xml) seem to include:
I guess any additional property should go to the custom.xml
Regarding the use of a special section for custom properties, I think it wouldn't be necessary, but I'm not against it either. We could add all not-ignored-properties (not ending in _) to the custom bag, or just the contents of the custom section in the yaml. What would make more sense taking into account the rest of the writers?
BTW docx custom properties can be strings, but also numbers, dates and booleans:
<property name="AMBCustomKey" pid="4" fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}">
<vt:lpwstr>AMB Custom Value</vt:lpwstr>
</property>
<property name="N煤mero de documento" pid="5" fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}">
<vt:i4>42</vt:i4>
</property>
<property name="Fecha de registro" pid="6" fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}">
<vt:filetime>2019-01-30T23:00:00Z</vt:filetime>
</property>
<property name="AMBBinary" pid="7" fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}">
<vt:bool>true</vt:bool>
</property>
In case we want to take that into account.
I added some basic support for this in a PR, trying to replicate what was done for ODT.
Even though we might end up changing which properties to write, both ODT and DOCX writers should be consistent.
Speaking about the docx writer:
I think I'll ignore extended properties for now (app.xml), but having support for core.xml already in pandoc, I think it would be useful to take advantage of it instead of putting all as custom properties.
Docx core properties title, creator, keywords, created, modified are already supported by pandoc (creator is author in pandoc properties, and the last two are calculated automatically).
However there are additional properties available which we could equate with some pandoc ones:
| docx core property | pandoc property | notes |
|--------------------|-----------------|--------------------------|
| ~subject~ | ~subtitle~ | similar to other writers? No, see remarks in the PR |
| ~description~ | ~abstract~ | |
| language | lang | does Word use it? |
There are additional docx core properties that we could parse from pandoc directly without changing their names:
This is a non-exhaustive list, but I think I got the main entries.
I'll try to modify the PR accordingly.
UPDATE:
Identifier and version in core.xml seem to get lost or are not accesible from within Word. Revision is also giving me grief in core.xml, I can completely break Word with it. I will remove these three from core (note that they still can be used as custom properties).
I'm not sure if language is doing anything in core, I do not see a difference one way or another. Should I also move it to custom?
Subject is quite different from subtitle and should not be made equivalent. Thanks @HeirOfNorton
I've improved the PR https://github.com/jgm/pandoc/pull/5252 in order to align writing document properties and custom properties in docx, odt and pptx
@agusmba can this issue now be closed? if not, what remains to be done?
Well, I only implemented the writer part, and the title of this issue includes and vice versa, so I guess it also requests the reader part. We could open a different issue for the reader part and close this one, or keep this one open.
Well, I only implemented the writer part, and the title of this issue includes and vice versa, so I guess it also requests the reader part. We could open a different issue for the reader part and close this one, or keep this one open.
That's fine, we can keep this open -- I just wanted to
summarize what was still needed: reading custom
properties in docx into pandoc metadata.
Most helpful comment
This would be a very nice feature.
At least in the
md -> docxdirection, as it would add the template-variable-substitution that we have on the template based writers (all save docx, odt, ppt...).We would be removing the need to edit headers and footers in word after conversion to enter the correct "title" or "department" etc.