Pandoc: Extracting more document structure from docx files

Created on 28 Dec 2014  路  44Comments  路  Source: jgm/pandoc

Many thanks for the introduction of the docx reader in 1.13! This improves some of my workflows significantly. Unfortunately, I have a few use cases that do not work well under the current docx reader.

Typically, these cases are the result of extensive use of templates where all of the styles are based on the "Normal" style, as opposed to being on, say, "Heading 1" something more semantically appropriate. As a result, Pandoc is able to extract minimal semantic information about the text. This is completely understandable, but nevertheless frustrating, because the user must manually re-apply the desired markup to the resulting Markdown.

An initial proposal might be to include more of this information in the output, i.e. essentially tagging all the output content with the name of the style(s) applied to it. So for example, a paragraph styled as "AChapterTitle" might result in the following Markdown:

<div class="AChapterTitle">

This is a chapter title

</div>

The downside of doing this is of course that it makes the resulting Markdown that much messier (though arguably this only affects files with extensive use of user-defined non-semantic styles). But the upshot is that it makes it possible for the user to write a filter---which knows the mapping from template styles to Markdown constructs---to clean up the file and restore the markup to a more semantic state. This approach adds some implementation complexity to the docx reader, but otherwise doesn't complicate Pandoc itself much if at all.

A more sophisticated alternative might be to pull the filter into Pandoc itself; i.e. to provide an API with special support for Word documents in the docx reader, and then provide a special --input-filter flag in Pandoc which would connect via this new API to the filter, which would then be able to look at the structure of the Word document and tell the docx reader directly what styles correspond to which semantic properties. This obviously adds additional implementation complexity to both the docx reader and to Pandoc, but has the advantage of avoiding information loss in the initial conversion to Markdown.

Or perhaps there might be alternative ways to extract this information. I would be happy with any approach that avoids throwing away the information entirely.

enhancement

Most helpful comment

I for my part would be perfectly content with a +custom_styles extension which would cause the docx reader to wrap paragraphs and runs with a custom paragraph or character style in divs/spans with a custom-style attributes, i.e. the 'reverse' of custom styles in docx output.

All 44 comments

Sounds like a good idea (maybe with a flag). @jkr can comment on the complexity.

You could try saving the .docx files as html (which pandoc reads more completely) and then write filters on that.

Re @mszep: That is exactly what I used to do, before Pandoc 1.13. Aside from the fact that the process is not automated (ok, maybe possible through COM, but that's a whole world of pain on its own), the HTML produced by Word is singularly awful. Among other things, Word mixes the use of style="..." and CSS classes, applies styles to non-semantic markup elements (e.g. using p instead of ul and applying CSS to cover over the differences), etc. Most importantly, the templates in question might or might not appear (in an obvious way) in the resulting HTML. So (a) you would need to filter before passing the HTML into Pandoc to have any hope of getting anything useful, and (b) writing said filter would be a complete pain.

Thus I'm hoping that the document structure will be easier to discover on the docx side, before any conversions are made (to HTML or Markdown or anything else). Whether or not that is actually the case is another question...

My original code for the docx reader, back in the docx2pandoc days, was actually very noisy, and kept a lot of information in divs and spans for future filtering, but that was removed at John's request. You could look back at the mail archives for initial discussion about that.

Note that some of this is solved for character styles, which now reads inheritance. It will be solved a bit more for paragraph styles (so, better reading of inherited quotes and whatnot) as well in the near future, but I think we'll have to stop short of getting the character data (bold, etc) from paragraph styles inheriting from Normal.

The --input-filter thing does seem like an interesting idea, but I'm concerned about the need to invent another API/file format that we would have parse. Could there be some well-specified format/API already?

By the way, @elliottslaughter, there's another option that might be easier, and more consistent with the rest of pandoc, than the api:

There's a list in readers/Docx.hs called divsToKeep. Right now, it just holds on to things for later transformation, but what if this could be specified with a command-line argument:

pandoc foo.docx --keep-parstyle="Snap" --keep-parstyle="Crackle" -o out.html

Haven't really thought that through, and the implications in the code are a bit fuzzy for me now, but that seems simpler. It involves knowing the name of your styles, but so would the fancier api version.

How about a flag --verbose-docx (modulo bike-shedding) which has the previous much noiser output? That seems like the simplest and most general solution to me.

I am hitting a similar issue due to the lack of proper support for captions in DOCX (see thread on the mailing list here). Having a --keep-parstyle option would greatly help to implement further conversions outside of pandoc (mapping to HTML figcaption or caption) (--keep-parstyle="Caption").

I think the problem here is that you could possibly be drowning in
unwanted verbosity. Docx is very verbose.

"If we had a keen vision and feeling of all ordinary human life, it
would be like hearing the grass grow and the squirrel's heart beat, and
we should die of that roar which lies on the other side of silence."

Matthew Pickering [email protected] writes:

How about a flag --verbose-docx (modulo bike-shedding) which has the previous much noiser output? That seems like the simplest and most general solution to me.


Reply to this email directly or view it on GitHub:
https://github.com/jgm/pandoc/issues/1843#issuecomment-68222056

One potential pitfall with a pick-and-choose method compared to grab-all is that when pick-and-choose fails, it usually does so in an opaque way. For example, if you make a typo in a --keep-parstyle parameter, then you'll lose any formatting with that flag, and you won't know why just looking at the output. This only gets worse if the problem is not a typo but a result of Word doing things you didn't expect. Therefore, I would like to have some sort of --verbose-docx functionality, even if I have to spell it like --keep-parstyle='*', so that I can use it to discover what sort of formatting is present in the document for me to take advantage of.

The change itself is fairly simple. See the diff below for how it work,
minus the toggle and some details about metadata and lists. And I can
certainly see how this would allow the filter-savvy to go beyond the
capabilities of plain pandoc without changing normal usage. But here's a
question: a good deal of effort is put into figuring out what styles
_mean_ (through parsing inheritance in the style file, keeping track of
certain names that are used by convention, and a few other
tricks). Would you want all headers/blockquotes/lists to

a) also keep their divs in addition to transforming into pandoc
blocks?
b) transform into pandoc blocks like normal and only keep the verbosity
for ones that don't transofrm into pandoc (and would otherwise be lost)?
c) only be divs, and leave the transformation up to you (you don't want
to do that for lists)?

I'm sure there's some other options here too. Anyway, if you're set up
to build, try the diff below and see if it's sorta (again, quick and
dirty) like what you're looking for.

diff --git a/src/Text/Pandoc/Readers/Docx.hs b/src/Text/Pandoc/Readers/Docx.hs
index 64eb032..3c060fd 100644
--- a/src/Text/Pandoc/Readers/Docx.hs
+++ b/src/Text/Pandoc/Readers/Docx.hs
@@ -413,16 +413,15 @@ trimLineBreaks ils = ils

 parStyleToTransform :: ParagraphStyle -> (Blocks -> Blocks)
 parStyleToTransform pPr
-  | (c:cs) <- pStyle pPr
-  , c `elem` divsToKeep =
-    let pPr' = pPr { pStyle = cs }
-    in
-     (divWith ("", [c], [])) . (parStyleToTransform pPr')
  | (c:cs) <- pStyle pPr,
   c `elem` listParagraphDivs =
     let pPr' = pPr { pStyle = cs, indentation = Nothing}
     in
      (divWith ("", [c], [])) . (parStyleToTransform pPr')
-  | (c:cs) <- pStyle pPr =
-    let pPr' = pPr { pStyle = cs }
-    in
-     (divWith ("", [c], [])) . (parStyleToTransform pPr')
  | (_:cs) <- pStyle pPr
  , Just True <- pBlockQuote pPr =
   let pPr' = pPr { pStyle = cs }

Just a few comments to @mszep and @elliottslaughter regarding the use of HTML as an intermediate format, which isn't _quite_ dead yet! :-)

I have been using the HTML exported from LibreOffice as intermediate format for a long time, preprocessing it with Perl scripts using HTML::Tree, which isn't wholly dissimilar from how I use Data::Rmap to write pandoc filters. Note that I'm talking about the File -> Export... -> XHTML route, and _not_ File -> Save as... -> html which is almost as bad as MS Word 'HTML'. (At least that's what I think the menu items are called, since I can't install a non-localized version of LO!) The _former_ uses CSS/classes almost exclusively, albeit with inscrutable names, and what's left can easily be brought in line with HTML Tidy aka tidy.

The scripts basically do three things:

  1. Rename classes to something less annoying:

    • Spaces in builtin style names show up as _20_ in class names, so I do s/_20_/_/g on all class attributes.

    • Formatting which was applied with CTRL-I etc. become automatic styles show up with classes like .P1 .P2 .T1 .T2.

  2. Translate styles/classes into elements like _em_ and _strong_ which pandoc can interpret correctly.

    • The main annoyance is that too much is translated into divs/spans with classes, even emphasis and strong emphasis, so I have to translate them back!

    • Those automatic style numbered classes are assigned to combinations of formatting features semi-randomly, so that different (combinations of) formatting features correspond to different numbers for different documents. Because of this I have to write a custom config file for each document, which still beats having to do all the editing by hand!

  3. Transfer other classes/attributes to wrapping _div_ or _span_ elements as appropriate/needed.

In the old days before pandoc had attributes on other elements than code, and no builtin filter invocation for that matter, I had to inject code elements which were then converted to raw LaTeX or HTML as appropriate, using pandoc -t json input.md | perl filter.pl | pandoc -f json -t latex -o output.ltx. Nowadays I just assign attributes to a _div_ or _span_ element which tell my filter which raw markup snippets to indject at the beginning and end of the element's content list. It is namely the case that having your filter output things like

{"t":"Span","c":[["",[],[]],[
{"t":"RawInline","c":["latex","\\textsf{"]},
{"t":"Str","c":"with"},{"t":"Space","c":[]},{"t":"Str","c":"some"},{"t":"Space","c":[]},
{"t":"Emph","c":[{"t":"Str","c":"emphasis"}]},{"t":"Space","c":[]},{"t":"Str","c":"inside!"},
{"t":"RawInline","c":["latex","}"]}
]}

and exporting directly to LaTeX actually overcomes the limitation that you cannot have Markdown inside LaTeX comman arguments and environments. So from the HTML I produce markdown like

<span cmd=textsf>with some *emphasis* inside!</span>

and can get both working LaTeX and working HTML from the same source (which I of course use with documents written from scratch as well!):

{\textsf{with some \emph{emphasis} inside!}}

<span class=textsf>with some *emphasis* inside!</span>

I'm in the process of cleaning up, defragilizing, documenting and uploading my pandoc-related scripts and filters, and documenting the _span2cmd.pl_ and getting up the _lohtml2pdchtml.pl_, which needs a lot of cleanup and simplification, are next in line. We'll see what I can achieve in the weekend, since I'll be busy with RL both tonight and on Saturday.

@jkr: Perhaps using existing processing tools to clean up a data structure obtained from the docx XML might work for the docx reader too? In any case I guess a general tool for shaping up docx docs might be welcome outside the pandoc community as well!

@jkr Your diff would be sufficient for my needs. It exposes enough information, and it actually isn't _that_ verbose compared to some of the HTML I was looking at from the same source. If something along these lines makes it into the next release, I'll be satisfied.

@jkr Actually, I lied. Turns out that the document has inline styles as well. This ends up being a double-whammy because the style is question is a templated used for inline code, and in addition to not getting the style itself, I'm also missing some leading spaces used as indentation. Is there any way the approach could be tweaked to handle this as well?

Does anyone want to summarize this issue in light of current pandoc? Are changes still desirable? What would need to change? Or can this be closed?

The issue is essentially the same: right now we go through a lot of trouble to figure out what paragraph and character styles mean, and if we can't figure it out (and it's not one that we want to keep) we throw it out. The initial suggestion here would be to have a toggle to keep everything for the sake of filtering. This would produce next-to-unreadable output, but would be useful as part of a filter pipeline. Questions that would have to be answered:

  1. Would this be in addition to interpretation or instead of interpretation? I.e., right now we go through a lot of trouble, checking property inheritance, to figure out whether "MyFancyLongQuote" is a blockquote or not. Would we add a div to our interpretation, or use a div instead of an interpretation?

  2. What would be the use case? People who know the structure well already and want to write a filter for it? People who want to inspect the structure? Would this be for reusable filters based on a template or one-and-done filters (say, for converting a long manuscript)?

  3. Would these actually be reusable? Internal class names can change based on whose version of word last saved it.

My take: I'm quite sympathetic to the desire to hang on to structure; it feels weird to throw away information that could be used. But, in general, I don't know if pandoc is the right vehicle for this. The proposal would be, essentially, to produce an XML dialect (using divs and spans) with markdown embedded in the text content. That's something pandoc could do, but I'm not sure if it's something it should do.

Ultimately, this sounds more like something that would be best served by a standalone docx library which could be as lossless as we want. And I'd like to do that. But, realistically, if that's waiting on me, it probably isn't going to happen for another year or so.

I agree with @jkr's assessment, and I'm going to close this.

I expect this is closed and long-dead, but just to add, following this discussion ...

https://groups.google.com/forum/#!msg/pandoc-discuss/09NoQebno9c/_3c4v7BICAAJ

It's a shame the issue closed, this basically prevents me from using Pandoc. We have a lot of custom styles, introduced through company branding. Right now, with Pandoc these are lost.

Worth noting that Mammoth allows the user to specify a style map where Word styles are mapped HTML (with the ability to specify a class). That said, Mammoth doesn't support font mapping.

I'd even be happy with all divs making it through, as it's pretty easy to strip these out using Python Beautiful soup.

Well, perhaps I was premature in closing this. I'll reopen it.
Perhaps it's worth considering a --verbose-docx flag for use in filters (or with HTML where you can use CSS). To be useful I think it would need to handle both div and span level styles.

I for my part would be perfectly content with a +custom_styles extension which would cause the docx reader to wrap paragraphs and runs with a custom paragraph or character style in divs/spans with a custom-style attributes, i.e. the 'reverse' of custom styles in docx output.

I would be happy to have the --verbose-docx and then just use a custom filter to do my own filtering.
As it is, pandoc completely removes crucial structure from my .docx files that I have no way of retaining. It's actually a very frustrating situation made even worse.

@ashnur -- the actual fix should remain pretty easy, but I'm afraid people have different ideas of what they might want the "verbose" output to look like. (Should it interpret dependent styles or just leave them in divs.) In any case, it's mainly just deleting a few lines from Docx.hs.

If you post a docx file that has the structure you're describing, I'd be happy to post some verbose and see if it fits with what you and others want.

(I do hope that pandoc isn't really making your frustrating situations even worse, though. At worst we should be leaving them just as frustrating as they were before.)

Pandoc should definitely parse the docx to a somewhat simplified representation (otherwise you could just have unzip the docx yourself and use an xml-parser). But I quite liked @bpj's suggestion above. Wouldn't that cover most use-cases?

Maybe -- and, as I said, it would be, on the whole, quite easy. For the divs, for example, just switch out parStyleToTransform in Docx.hs with something like:

~haskell
parStyleToTransform :: ParagraphStyle -> (Blocks -> Blocks)
parStyleToTransform pPr =
if null (pStyle pPr)
then id
else divWith ("", pStyle pPr, [])
~

or (using custom-styles)

~haskell
parStyleToTransform :: ParagraphStyle -> (Blocks -> Blocks)
parStyleToTransform pPr =
if null (pStyle pPr)
then id
else divWith ("", [], map (\s -> ("custom-style", s)) pStyle pPr)
~

There might be a few more details, but that's pretty much all there is to it. The question is whether the output from this would really be what people want. It wouldn't just add verbosity -- it would also no longer figure out dependent styles.

Or do we say that we do interpret dependent styles, and only use the divs as a fallback. In that case, we might be losing style names that some of these folks might want. In that case we use the above code as a fallback at the end of the parPropsToTransform, instead of the id fallback there currently.

Or both? Interepret and wrap?

All of the above also goes for code blocks and a few other cases.

All of this is fairly trivial to implement. But the questions above are why I'd like to see some real-world files. To make sure that people want the same thing before we go about implementing it.

pandoc is helping me be less frustrated :) sorry for being ambiguous, there are other stuff that are frustrating and i got yet again blocked by something unexpected, so currently i am exploring alternative ways to parse .docx. to be honest, i was way too optimist when i first considered using pandoc to parse .docx, i would've not guessed that just by parsing it i can lose so much structure, like a wrapping ordered list over the whole content. :)

Well, we shouldn't be losing ordered lists, in any case. I'd be curious to see the document, though I understand if you can't post it for business or personal reasons.

4298 _(how to waste an hour from your life because you want to be nice, get called on your tone, get no help :) )_

@jgm on this larger issue:

  1. any thoughts on wrapping resolved styles vs leaving them unresolved? My preference would be to resolve them, and then wrap them in a div, since all of the style dependency info would be lost after conversion. Seems the best of both worlds.

  2. Thoughts on the flag name/Options.hs field? People had suggested --verbose-docx but this seems to step on --verbose a bit, and it occurs to me that this could potentially be used in other readers too. --read-all or something like that?

See #4299

it would also no longer figure out dependent styles.

@jkr what do you mean by dependent style here? As I understand it when you use a named style through custom-style=Foo you just say "Associate the custom style 'Foo' with the paragraphs inside this div/the characters inside this span" and if your reference docx doesn't already define that style Pandoc creates a stub style with no ancestors other than the default style, which you then have to modify, including making it dependent on other style(s). There is no info on dependencies among these styles in the Markdown document I send to Pandoc, or is there? My suggestion, which @mb21 supported was just to do that "in reverse": if a para or run of characters in the input docx is associated with a named style "Foo", wrap it in a div/span with custom-style="Foo" If style dependencies enter the picture here I'd be happy to learn how, as it probably means I've missed something useful!

BTW apologies for giving myself a thumbs-up -- a failed attempt to find out who had given their support while browsing on my tablet. I'm clearly far too confirmation hungry! :-/

BTW I personally would need this functionality with spans/character styles at least as much as with divs/paragraph styles.

Styles are built on top of other styles ("OurFancyHeader" is built on top of "Header1", for example). Right now the reader will figure that out, and make "OurFancyHeader" into a level-1 header. We can keep the style name and put it around a header, or keep it and put it around plain text, or not keep it because we figured out what we meant. In any case, the only difference is whether it is the name of the span/div class, or whether it's a kv "custom-style" value.

Anyway, if you're able to build pandoc dev versions, you can take a look at the version I put up yesterday at #4299 and see if it does something like what you want it do. It works with both char/spans and par/divs. Try it out and see if it does what you want it to do. (My sense is that people are using quite different terminology, so until folks actually try it on some documents, I'm not sure there will be agreement over whether it satisfies their needs.)

There is info on dependencies among these styles in the Markdown document

should have been "There is no info on dependencies among these styles in the Markdown document"

Anyway, if you're able to build pandoc dev versions, you can take a look at the version I put up yesterday at #4299 and see if it does something like what you want it do.

I've had variable success with building from dev in the past. I'll give it a try if I get some time in the weekend.

In any case, the only difference is whether it is the name of the span/div class, or whether it's a kv "custom-style" value.

I would very much prefer kv "custom-style" value as I frequently convert docs back and forth between docx/md when collaborating with people using Word or LibreOffice and its ilk.

Perhaps the --verbose-docx or whatever option could take an argument class or kv? People like me and people converting docx to HTML for example clearly have different needs here.

kv "custom-style" seems fine to me, and I can see how the round-tripping would make it a better choice. If you post a docx that requires this (i.e. where valuable info is lost in the current version), I could run the dev version on it and post the output for you.

It occurred to me that with a filter it's easier to convert a kv "custom-style" into a class than the other way around because you would need to know which class to use by using some fragile rule-of-thumb like the first/last/nth class, which easily breaks inadvertently if you add a class. If you convert the other way around you just take the value of the "custom-style" key and append it to the list of classes.

Unfortunately I don't have any real file which I'm allowed to share.

A Lua filter for appending the value of kv custom-style as a class and optionally removing the kv by setting the metadata kv remove-custom-style: true or running pandoc with -M remove-custom-style. Not much of a contribution but it is what I can do ATM. I also started this page on defining/modifying styles in LibreOffice which hopefully amounts to something.

````lua
--[[

custom-style2class.lua -- Pandoc filter to add custom-style values as classes.

This is useful when you produce both DOCX and HTML from the same source
and use custom-style key--value to apply DOCX custom styles,
but want the same available to CSS as a class. It simply takes the
value of the custom-style key and appends it to the list of classes.

To also remove the custom-style kv set the metadata kv remove-custom-style: true,
or run pandoc with:

pandoc --lua-filter=custom-style2class.lua -M remove-custom-style [OPTIONS] [FILENAME]

]]

local remove_style

function get_meta (meta)
remove_style = meta['remove-custom-style']
return meta
end

function custom_style2class (elem)
local style = elem['attributes']['custom-style']
if not style then return end
table.insert(elem.classes, style)
if remove_style then
elem['attributes']['custom-style'] = nil
end
return elem
end

return {
{ Meta = get_meta },
{ Div = custom_style2class, Span = custom_style2class, },
}
````

@bpj -- I'm convinced by the argument for custom-style kvs. (Or, more generally, I'm convinced that it should be able to roundtrip back to docx, using the original docx as a reference.docx.) Right now, this is waiting on agreement over what form the option should take (option flag, reader extension). I'm fairly agnostic on that front, so I'm going to defer to others, but I imagine this will be implemented pretty soon.

I'm convinced that it should be able to roundtrip back to docx, using the original docx as a reference.docx.

@jkr -- that's exactly what I need/want to do.

Great! When will this be released, approximately? Days, weeks, months? I'm not keen on building from source myself.

Not sure -- but you can always play (at your own risk) with nightly builds:

https://github.com/pandoc-extras/pandoc-nightly/

@bpj This was released on Mar 3 (v.2.1.2)

@agusmba Thanks I know! :+1:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

krobelus picture krobelus  路  4Comments

chrissound picture chrissound  路  4Comments

transientsolutions picture transientsolutions  路  3Comments

danse picture danse  路  3Comments

timtroendle picture timtroendle  路  3Comments