Pandoc: docx to html hyperlink bug

Created on 7 Nov 2018  路  25Comments  路  Source: jgm/pandoc

C:\Users\i\Downloads\pandoc-2.3.1-windows-i386\pandoc.exe -s a.docx -o example35.html
test file a.docx

The text with a hyperlink in docx file is always followed with a <span class="underline"> in the converted HTML file , which is of no usage and instead induce problems .
BTW, the bug also occurred when convert docx to Markdown .

Docx reader

All 25 comments

Well yes...

$ pandoc -t native /Users/maurobieg/Downloads/a.docx 
[Para [Link ("",[],[]) [Span ("",["underline"],[]) [Str "pandoc"]] ("https://github.com/jgm/pandoc",""),Space,Str "is",Space,Str "good"]]

The question is how did the underline get into the word file? How did you create the link in the file?

The text is underlined in the Word file, so in that sense pandoc's conversion is correct.
But it seems to be Word's default to add underlining when you insert a hyperlink, so a case could be made for ignoring this for links. @jkr any thoughts?

Just my 2c.

I know it is the default to have links underlined in Word, but since Word can change styles (in fact my pandoc-word template has links without underlines), I would try not being too smart about this.

If the original is underlined we could assume the intention is to keep the underline (otherwise, shouldn't it be removed from the original?)

This is something that, for the most part, we already do. We make an effort not to pick up the underlining if it's the result of a hyperlink style (as opposed to explicitly being underlined). Our general approach is specifically to blacklist the style named "Hyperlink" when we're figuring out style formatting. (See blacklistedCharStyles in Docx.hs).

Unfortunately, the docx in the bug report has a custom hyperlink style (a7). Now in the styles.xml file, this is listed as having the "Hyperlink" name, so it should be something we should be able to figure out through checking the style inheritance, in order to figure out the blacklisting a bit better. It should be a fairly straightforward fix. I'll take a look.

Actually, this is sort of a judgment call. The problem is that @redstoneleo used a custom hyperlink style (a7) but it has a <w:name val="Hyperlink"/> in word/styles.xml. According to this page, the w:name element

Specifies the primary name for the style, which can be used in the user interface. The name is stored in the val attribute. Note that alternate names will be used by the user interface if specified by the aliases element (and if the appropriate value is set in the stylePaneFormatFilter element within the settings part).

So on the one hand it doesn't mean anything internally (it could be anything, and still work the same), but on the other hand it is what is exposed to the user through the UI (based on somewhat else's fiddling, perhaps). My inclination is to try to honor it, but it won't be an immediate fix, since the RunStyle type currently ignores that field.

@jgm (and all other interested parties): what do you think?

ping @lierdakil

I'm with @agusmba on this -- pandoc shouldn't try to be too smart about figuring out styles. Being too smart creates problems more often than solves them, and this particular issue is easily rectified by a 5-line lua filter:

function Span(el)
  if el.classes[1] == "underline" then
    return el.content
  end
end

As a general observation, guessing intent from display name sounds like an all-around bad idea. Consider potential i18n woes, for instance.

That said, I have a vague recollection that Word might not actually follow the spec, and instead use w:name as some sort of internal "invariant" name, and actual style identifier as a mangled display name, or something to that effect; last time I dived into this mess was about 3 years ago though, and details are very fuzzy, so don't trust me on this. If anything, this is a caution that you shouldn't really trust anything written in spec because in my experience Word sometimes does something outright opposite of what's written there.

When I opened a.docx, added my own hyperlink, and saved the result, Word replaced the old a7 id with w:styleId="Hyperlink", which it used for both the existing link and my new one. This suggests to me that maybe the w:name is being used as an internal invariant.

So I think it would be reasonable to treat this like any other hyperlink, but I don't think it's hugely important, because it's easy to get rid of the unwanted span if you need to.

Thanks everyone !
I agree with, at least, a case should be made for ignoring this for links.
Mind that docx to Markdown also suffers the same issue, the converted Markdown text

[[pandoc]{.underline}](https://github.com/jgm/pandoc) is good

while many Markdown editors doesn't render this well.

I guess the behavior is not most users want , so better hide it by default , if there are users who actually want it, pandoc could give them a switch to turn it on - the to-be-implemented pandoc feature .

while many Markdown editors doesn't render this well.

This is not really relevant to the discussion, but "many Markdown editors" don't handle pandoc-flavoured markdown in the first place. If you want compatibility, output to commonmark instead (i.e. pandoc -t commonmark a.docx), or perhaps even commonmark-raw_html to suppress spurious spans and divs.

Thanks @jkr for the clarification. I can see that in general it is a good practice to blacklist the default Hyperlink style.
I'm guessing then that if a user wants a specific formatting for his links he should use a different custom style (no idea how to automatically apply this in word).

In my own case, I have modified the "Hyperlink" and "Visited hyperlink" styles to remove underlines, so actually I'd not suffer from this "blacklisting". However, for the sake of understanding better, if a user wanted to have "bold" hyperlinks transferred to markdown, he wouldn't be able to do it just by modifying the Hyperlink style, is that right? (not saying this is wrong, just trying to understand how one would go about doing it)

Sorry if this feels like a sidetrack for the discussion but it could be relevant in order to decide when we want to ignore the Hyperlink style and when we should try to include it in the transformations.

Doing some more tests locally:

I created a new word document using the default blank page template.

$ pandoc -t native build/pandoc-5052-default.docx
[Para [Str "This",Space,Str "is",Space,Str "normal",Space,Str "text"]
,Para [Str "This",Space,Str "is",Space,Str "a",Space,Link ("",[],[]) [Span ("",["underline"],[]) [Str "link",Space,Str "to",Space,Str "pandoc"]] ("https://github.com/jgm/pandoc/issues/5052","")]]

I see the underline, I expected it would have been blacklisted and removed, maybe something to do with my having a Spanish locale? (it shows localized style names)

I modified the hyperlink (hiperv铆nculo) style to remove the underline:

$ pandoc -t native build/pandoc-5052-hyperlink-not-underlined.docx
[Para [Str "This",Space,Str "is",Space,Str "normal",Space,Str "text"]
,Para [Str "This",Space,Str "is",Space,Str "a",Space,Link ("",[],[]) [Str "link",Space,Str "to",Space,Str "pandoc"] ("https://github.com/jgm/pandoc/issues/5052","")]]

This works as expected.

Lastly I modified the hyperlink (hiperv铆nculo) style to add "bold"

$ pandoc -t native build/pandoc-5052-hyperlink-not-underlined-bold.docx
[Para [Str "This",Space,Str "is",Space,Str "normal",Space,Str "text"]
,Para [Str "This",Space,Str "is",Space,Str "a",Space,Link ("",[],[]) [Strong [Str "link",Space,Str "to",Space,Str "pandoc"]] ("https://github.com/jgm/pandoc/issues/5052","")]]

Again this works as expected, but I'm not sure if it does because of my locale not being english, as it seems the hyperlink style is not being blacklisted anyway.

pandoc-5052-agusmba-tests.zip

I'm a bit torn between having pandoc work "properly" for 90% of the use cases who do not care for the default underline on links when converting to other formats, and the probably 1-2% (?) who actually do want the underline to be transferred to other formats. I assume the 8-9% left have tweaked the hyperlink style to remove the underline :wink:

@agusmba, I'm in a bit of a rush (lately, perpetual rush, that is), so it'd be great if you could check your docx style XML (by unzipping it and looking at word/styles.xml -- you will likely want to prettify/reformat it first) to see what styleId on the hyperlink style is and if it has w:name val="Hyperlink". From my fuzzy recollections, styleId in international Word versions is incomprehensible (kinda like a7 or p10), and can change arbitrarily, while the meaningful semantic information is carried by w:name (which Word then happily uses to find its base styles in the style.xml and change styleId to whatever it feels like when the document is opened in a different Word version). This would at least answer the question of "do international Word versions mangle styleId", and if those do, it would be a good argument in favour of honouring w:name.

Sure thing!

In the default docx:

<w:style w:type="character" w:styleId="Hipervnculo">
    <w:name w:val="Hyperlink"/>
    <w:basedOn w:val="Fuentedeprrafopredeter"/>
    <w:uiPriority w:val="99"/>
    <w:unhideWhenUsed/>
    <w:rsid w:val="00143BD0"/>
    <w:rPr>
        <w:color w:val="0563C1" w:themeColor="hyperlink"/>
        <w:u w:val="single"/>
    </w:rPr>
</w:style>

in the not underlined:

<w:style w:type="character" w:styleId="Hipervnculo">
    <w:name w:val="Hyperlink"/>
    <w:basedOn w:val="Fuentedeprrafopredeter"/>
    <w:uiPriority w:val="99"/>
    <w:unhideWhenUsed/>
    <w:rsid w:val="00E319DB"/>
    <w:rPr>
        <w:color w:val="0563C1" w:themeColor="hyperlink"/>
        <w:u w:val="none"/>
    </w:rPr>
</w:style>

So yeah, it seems the name val is consistent

Right. So I conjecture that styleId is only incomprehensible in locales with non-ascii alphabets (e.g. Ukrainian, Hindi, Russian, Japanese, etc; perhaps also German and other languages using extended latin). In any case, it's locale-dependent, while name is apparently not. So yeah, it would appear that making pandoc match on name instead of styleId is a good idea (optionally we can still fall back on styleId but I'm not entirely sure we need to -- need more research to decide).

Thanks, @agusmba.

No problem @lierdakil, I know the feeling of perpetual rush.

I'd like to ask though, if we fix this issue by properly blacklisting "Hyperlink" styles in other Word languages, would it make sense to have the option of not blacklisting it, in case a user would like his custom hyperlink style modifications to carry on to other formats?

I'm not actually requesting it, since personally I'm fine with striping the hyperlink styling when reading from docx and having the liberty to add whatever link-style I want on the writer side (templates, custom styles on reference doc, lua filters...). Just asking, since fixing this could "break" backwards compatibility for internationalized versions of Word (even if it was due to not applying consistently the blacklisting rule)

if we fix this issue by properly blacklisting "Hyperlink" styles in other Word languages, would it make sense to have the option of not blacklisting it

From where I'm standing, it's more a question of consistency at this point. If pandoc is stripping hyperlink styles from en_us documents, but not from documents produced with other locales, I'd say it's a problem.

If we decide that Word's hyperlink style carries semantic, and not purely cosmetic information, we kinda have to ignore the style details. Besides, I'd say it makes much more sense to apply styles to hyperlinks "in bulk" (via css and whatnot) rather than adding spans to each hyperlink individually.

That said, a more flexible solution might be to just make blacklisted styles configurable via reader options, with the default being blacklisting hyperlink style.

Another option would be only stripping underline from hyperlinks, but that sounds a bit too specific to be useful in general.

I completely agree with your first two points.

That said, a more flexible solution might be to just make blacklisted styles configurable via reader options, with the default being blacklisting hyperlink style.

This would be nice, although not a must have in order to fix this issue (depending on the amount of work involved maybe let it have it's own issue?)

Another option would be only stripping underline from hyperlinks, but that sounds a bit too specific to be useful in general.

Yeah, I wouldn't do this either.

Nikolay Yakimov notifications@github.com writes:

From where I'm standing, it's more a question of consistency at this point. If pandoc is stripping hyperlink styles from en_us documents, but not from documents produced with other locales, I'd say it's a problem.

I agree with this. We should be matching on the name
rather than the id, which can vary unpredictably.

A few additional notes.

After taking a quick look on some of the docx files I had on hand, custom user-defined styles have their actual display name in name, while Word's "builtin" styles have name set consistently in English. styleId values seem to be completely arbitrary -- not even necessarily a localized name or anything of the sort.

I've also noticed hyperlinks aren't the only place where docx reader relies on style identifiers, by the way, there are actually quite a few places where it does. All of those are potentially subject to this issue (I say potentially because as far as I can tell, Word won't touch custom styles)

Huh. I knew this all sounds vaguely familiar. I've fixed something very similar to this for docx writer a while back, apparently, in #1716 and #1968 -- thought this might be worth referencing for context.

This issue is fixed by #5732, and hence can be closed:

$ pandoc -f docx -t markdown a.docx
[pandoc](https://github.com/jgm/pandoc) is good
$ pandoc -f docx -t native a.docx
[Para [Link ("",[],[]) [Str "pandoc"] ("https://github.com/jgm/pandoc",""),Space,Str "is",Space,Str "good"]]
Was this page helpful?
0 / 5 - 0 ratings

Related issues

jgm picture jgm  路  51Comments

jgm picture jgm  路  48Comments

jgm picture jgm  路  266Comments

stepht picture stepht  路  54Comments

nrnrnr picture nrnrnr  路  49Comments