Pandoc: <strong> tag not closed correctly or invented 😳

Created on 9 Jul 2020  ·  14Comments  ·  Source: jgm/pandoc

I noticed that sometimes the <strong> tag is not closed in the correct place or that it is inserted where it does not exist. 😳
It may also happen for other tags (a.e. <em>) but for now I have noticed this behavior only on <strong>.

The problem occurs with both version 2.9.2.1 than branch with build 2.10 (#6511 ) of @lierdakil (which handles "negative inline formatting").

The anomalous behavior occurs with both the "-f docx" and "-f docx+styles" option

I am attaching a docx where the problem arises.
(The file was saved in LibreOffice but the original, coming from Word, also had the same problem)

bug_strong

_Result_

I. Cantico dei Cantici, di Salomone. 1 Mi baci con i baci della sua bocca! Sì migliore del vino è il tuo amore. Inebrianti sono i tuoi profumi per la fragranza, Aroma che si spande è il tuo nome: per questo le ragazze di te si innamorano. Trascinami con te, corriamo!

<strong> is closed after "fragranza," instead of "bocca!"

The Bride's Dream. 1 On my bed by night I sought him whom my soul loves; I sought him, but found him not. I will rise now and go about the city, in the streets and in the squares; I will seek him whom my soul loves.

Inexistent <strong> is created on "I will rise [...] loves."

Thank you

Files:
bug_strong_not_closed_or_not_exist.docx

Docx reader

Most helpful comment

Okay, I've looked at the spec more closely (ECMA-376 5th edition Part 1). Here's what we have there:

17.3.2.26 rFonts (Run Fonts)
This element specifies the fonts which shall be used to display the text contents of this run. Within a single run,
there can be up to four types of font slot which shall each be allowed to use a unique font:

  • ASCII (i.e., the first 128 Unicode code points)
  • High ANSI
  • Complex Script
  • East Asian

and below:

For each Unicode character in a run, the font slot can be determined using the following two-step methodology:

  1. Use the table below to decide the classification of the content, based on its Unicode code point.
  2. If, after the first step, the character falls into East Asian classification and the value of the hint attribute
    is eastAsia, then the character should use East Asian font slot

    1. Otherwise, if there is or in this run, then the character should use Complex
      Script font slot, regardless of its Unicode code point.

      1. Otherwise, the character is decided using the font slot that is corresponding to the
        classification in the table above.

Once the font slot for the run has been determined using the above steps, the appropriate formatting elements
(either complex script or non-complex script) will affect the content.

There's also a helpful diagram in there:
image

The table in question is a bit of a mess with ample exceptions and special cases, but the interesting part is that it doesn't directly determine whether the character is considered complex script or not. Also interesting point that w:hint="cs" is ignored according to spec.

Also, in I.3 WordprocessingML we have:

bCs [...] Specifies the bold property for a complex script run of characters, this is applied
when the “rtl” element is specified on a run. It is forced when the “cs” element is
specified (see the “cs” element later in this table).

So based on all this I think we can check for w:rtl and w:cs and be done with this.

All 14 comments

when converting to HTML? Sounds like a bug in the HTML writer then... can you do -t native and find the troublesome part of the document and post it here?

(https://pandoc.org/MANUAL.html#description for background on native format)

when converting to HTML? Sounds like a bug in the HTML writer then... can you do -t native and find the troublesome part of the document and post it here?

(https://pandoc.org/MANUAL.html#description for background on native format)

I don't think so because the problem also occurs with markdown

Now try -t native ...

-f docx -t markdown

***[I. Cantico dei Cantici, di Salomone.]{.ul}*** **[1]{.ul}** Mi **baci
con i baci della sua bocca! Sì migliore del vino è il tuo amore.
Inebrianti sono i tuoi profumi per la fragranza,** Aroma che si spande è
il tuo nome: per questo le ragazze di te si innamorano. Trascinami con
te, corriamo!

***[The Bride\'s Dream.]{.ul}*** **[1]{.ul}** On my bed by night **I
sought him whom my soul loves;** I sought him, but found him not. **I
will rise now and go about the city, in the streets and in the squares;
I will seek him whom my soul loves.**

-f docx -t native

[Para [Emph [Strong [Underline [Str "I.",Space,Str "Cantico",Space,Str "dei",Space,Str "Cantici,",Space,Str "di",Space,Str "Salomone."]]],Space,Strong [Underline [Str "1"]],Space,Str "Mi",Space,Strong [Str "baci",Space,Str "con",Space,Str "i",Space,Str "baci",Space,Str "della",Space,Str "sua",Space,Str "bocca!",Space,Str "S\236",Space,Str "migliore",Space,Str "del",Space,Str "vino",Space,Str "\232",Space,Str "il",Space,Str "tuo",Space,Str "amore.",Space,Str "Inebrianti",Space,Str "sono",Space,Str "i",Space,Str "tuoi",Space,Str "profumi",Space,Str "per",Space,Str "la",Space,Str "fragranza,"],Space,Str "Aroma",Space,Str "che",Space,Str "si",Space,Str "spande",Space,Str "\232",Space,Str "il",Space,Str "tuo",Space,Str "nome:",Space,Str "per",Space,Str "questo",Space,Str "le",Space,Str "ragazze",Space,Str "di",Space,Str "te",Space,Str "si",Space,Str "innamorano.",Space,Str "Trascinami",Space,Str "con",Space,Str "te,",Space,Str "corriamo!"]
,Para [Emph [Strong [Underline [Str "The",Space,Str "Bride's",Space,Str "Dream."]]],Space,Strong [Underline [Str "1"]],Space,Str "On",Space,Str "my",Space,Str "bed",Space,Str "by",Space,Str "night",Space,Strong [Str "I",Space,Str "sought",Space,Str "him",Space,Str "whom",Space,Str "my",Space,Str "soul",Space,Str "loves;"],Space,Str "I",Space,Str "sought",Space,Str "him,",Space,Str "but",Space,Str "found",Space,Str "him",Space,Str "not.",Space,Strong [Str "I",Space,Str "will",Space,Str "rise",Space,Str "now",Space,Str "and",Space,Str "go",Space,Str "about",Space,Str "the",Space,Str "city,",Space,Str "in",Space,Str "the",Space,Str "streets",Space,Str "and",Space,Str "in",Space,Str "the",Space,Str "squares;",Space,Str "I",Space,Str "will",Space,Str "seek",Space,Str "him",Space,Str "whom",Space,Str "my",Space,Str "soul",Space,Str "loves."]]]

-f docx+styles -t native

[Div ("",[],[("custom-style","testo commento")])
 [Para [Span ("",[],[("custom-style","testo tit")]) [Str "I.",Space,Str "Cantico",Space,Str "dei",Space,Str "Cantici,",Space,Str "di",Space,Str "Salomone."],Space,Span ("",[],[("custom-style","numero_glossa")]) [Str "1"],Space,Str "Mi",Space,Strong [Str "baci",Space,Str "con",Space,Str "i",Space,Str "baci",Space,Str "della",Space,Str "sua",Space,Str "bocca!",Space,Str "S\236",Space,Str "migliore",Space,Str "del",Space,Str "vino",Space,Str "\232",Space,Str "il",Space,Str "tuo",Space,Str "amore.",Space,Str "Inebrianti",Space,Str "sono",Space,Str "i",Space,Str "tuoi",Space,Str "profumi",Space,Str "per",Space,Str "la",Space,Str "fragranza,"],Space,Str "Aroma",Space,Str "che",Space,Str "si",Space,Str "spande",Space,Str "\232",Space,Str "il",Space,Str "tuo",Space,Str "nome:",Space,Str "per",Space,Str "questo",Space,Str "le",Space,Str "ragazze",Space,Str "di",Space,Str "te",Space,Str "si",Space,Str "innamorano.",Space,Str "Trascinami",Space,Str "con",Space,Str "te,",Space,Str "corriamo!"]]
,Div ("",[],[("custom-style","testo commento")])
 [Para [Span ("",[],[("custom-style","testo tit")]) [Str "The",Space,Str "Bride's",Space,Str "Dream."],Space,Span ("",[],[("custom-style","numero_glossa Carattere")]) [Str "1"],Space,Str "On",Space,Str "my",Space,Str "bed",Space,Str "by",Space,Str "night",Space,Strong [Str "I",Space,Str "sought",Space,Str "him",Space,Str "whom",Space,Str "my",Space,Str "soul",Space,Str "loves;"],Space,Str "I",Space,Str "sought",Space,Str "him,",Space,Str "but",Space,Str "found",Space,Str "him",Space,Str "not.",Space,Strong [Str "I",Space,Str "will",Space,Str "rise",Space,Str "now",Space,Str "and",Space,Str "go",Space,Str "about",Space,Str "the",Space,Str "city,",Space,Str "in",Space,Str "the",Space,Str "streets",Space,Str "and",Space,Str "in",Space,Str "the",Space,Str "squares;",Space,Str "I",Space,Str "will",Space,Str "seek",Space,Str "him",Space,Str "whom",Space,Str "my",Space,Str "soul",Space,Str "loves."]]]]

As a "test" (_for what it's worth_): I exported to html from word processor (both Word and LibreOffice) and the html is rendered correctly.
(_I could look inside the docx as it is made xml but I don't know read it_)

So I think the problem is in reading docx from pandoc.

So I think the problem is in reading docx from pandoc.

ah yes, you're right (I didn't read your initial post carefully enough).

After a quick look through the XML, those runs that aren't rendered as bold but interpreted by Pandoc as such, have bCs property, which means "complex script bold". Pandoc currently lacks the ability to figure out which scripts are "complex" and which are not, and hence interprets both as bold unconditionally.

Generally speaking, "complex script" properties should coincide with "normal" properties anyway, so this is usually not an issue. But for some reason, here those are all out of whack: both styles and font sizes are different between "normal" and "complex script". I don't know why, but basically the docx itself is very slightly broken.

The easy way to "fix" that is to remove complex script support from the document (since it only contains Latin characters). Not sure how to do that in LibreOffice, but there are guides on doing that in Word (for instance this)

If anyone has ideas on how to actually detect if the text in the run is "complex script" or not, please tell. I don't have any save for querying the OS API, which isn't something Pandoc should be reasonably expected to do.

(I could look inside the docx as it is made xml but I don't know read it)

https://superuser.com/a/278262/247552 :-)

After a quick look through the XML, those runs that aren't rendered as bold but interpreted by Pandoc as such, have bCs property, which means "complex script bold". Pandoc currently lacks the ability to figure out which scripts are "complex" and which are not, and hence interprets both as bold unconditionally.

Generally speaking, "complex script" properties should coincide with "normal" properties anyway, so this is usually not an issue. But for some reason, here those are all out of whack: both styles and font sizes are different between "normal" and "complex script". I don't know why, but basically the docx itself is very slightly broken.

The easy way to "fix" that is to remove complex script support from the document (since it only contains Latin characters). Not sure how to do that in LibreOffice, but there are guides on doing that in Word (for instance this)

If anyone has ideas on how to actually detect if the text in the run is "complex script" or not, please tell. I don't have any save for querying the OS API, which isn't something Pandoc should be reasonably expected to do.

I really don't know if this is the reason ...

I tried to remove "complex script" but it didn't work.
(setting english or italian)

I tried to save in .doc, .rtf, .odt and then return to .docx but it didn't work.

Maybe I'm doing something wrong ... but with 15,000 pages to process I'm feeling uncomfortable.

For a quick-fix I've published a patched version of Pandoc that adds ctl (for "complex text layout") extension to docx, enabled by default, which can be disabled to make Pandoc ignore complex script markers. I've implemented this on top of docx-styles-negative-inlines for the sake of simplicity. Get it with git clone -b docx-ctl-flag https://github.com/lierdakil/pandoc (or via direct download), build as usual, use with pandoc -f docx-ctl .... You can stack modifiers, i.e. pandoc -f docx-ctl+style ... or pandoc -f docx+style-ctl ... will work.

For a proper fix, I guess we could leverage libicu, by getting character script, and then determine if the script is complex or not via a lookup table... only issue is, besides an additional external dependency (i.e. libicu), Haskell bindings (text-icu) are apparently unmaintained and don't export the API to required query the character script. Could be possible to fake this based on character block, but there will be slight inconsistencies, so probably not a great idea generally speaking. I could slap together the required bindings, but I'm not sure the effort'd be worth it.

What is the meaning of "complex script" in docx?

I don't want to bind to text-icu; that would make pandoc much more difficult to install. Maybe we can find a good, 95% reliable heuristic.

I'd rather do something automatic than require an extension.

@jkr may want to comment as well.

@jgm, basically, some scripts (i.e. writing systems) make extensive use of contextual ligatures. Complex text layout is the generic umbrella term for handling those. "Complex script" is the term for scripts that generally require complex text layout support. Hopefully this answers the question somewhat.

Now, why does docx use different tags for complex scripts? Heck if I know. It's docx, it does a lot of weird stuff.

I'd rather do something automatic than require an extension.

That's a given. The only reason I did add the extension is I couldn't be bothered to bind to icu at the time.

Maybe we can find a good, 95% reliable heuristic.

Well, see, that's the issue. We already have a 95% reliable heuristic, which is to treat "complex script" tags the same as "regular" tags -- it mostly works, but here we're running into the case where it doesn't.

It's not impossible to write a pure Haskell implementation of isComplexScript :: Char -> Bool, based on data provided by icu, but it's potentially a giant hassle to maintain (basically, it'd need to be checked/updated every time there's a new Unicode standard)

Could we look at the document's language setting to detect this? Or that's not reliably set in docx? Also: doesn't this depend on the font used and whether kerning is enabled? Just asking question, because in the docx-world things probably aren't as you'd expect them to be :P

Actually, I've based what I said up until now on OOXML spec and what LibreOffice does (because I've had it on hand basically). I've now checked with Word 2019 and it straight up ignores bCs tags in my documents regardless of the script I feed it. As does Google Docs. LibreOffice, however, does its own thing.

Google Docs:
image

Word 2019:
image

LibreOffice:
image

So I'm not sure on the semantics of complex scripts in docx anymore. Perhaps @remy33 who reported #4947 could offer us some insight. Otherwise, more research required.

Okay, I've looked at the spec more closely (ECMA-376 5th edition Part 1). Here's what we have there:

17.3.2.26 rFonts (Run Fonts)
This element specifies the fonts which shall be used to display the text contents of this run. Within a single run,
there can be up to four types of font slot which shall each be allowed to use a unique font:

  • ASCII (i.e., the first 128 Unicode code points)
  • High ANSI
  • Complex Script
  • East Asian

and below:

For each Unicode character in a run, the font slot can be determined using the following two-step methodology:

  1. Use the table below to decide the classification of the content, based on its Unicode code point.
  2. If, after the first step, the character falls into East Asian classification and the value of the hint attribute
    is eastAsia, then the character should use East Asian font slot

    1. Otherwise, if there is or in this run, then the character should use Complex
      Script font slot, regardless of its Unicode code point.

      1. Otherwise, the character is decided using the font slot that is corresponding to the
        classification in the table above.

Once the font slot for the run has been determined using the above steps, the appropriate formatting elements
(either complex script or non-complex script) will affect the content.

There's also a helpful diagram in there:
image

The table in question is a bit of a mess with ample exceptions and special cases, but the interesting part is that it doesn't directly determine whether the character is considered complex script or not. Also interesting point that w:hint="cs" is ignored according to spec.

Also, in I.3 WordprocessingML we have:

bCs [...] Specifies the bold property for a complex script run of characters, this is applied
when the “rtl” element is specified on a run. It is forced when the “cs” element is
specified (see the “cs” element later in this table).

So based on all this I think we can check for w:rtl and w:cs and be done with this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jgm picture jgm  ·  48Comments

anton-k picture anton-k  ·  53Comments

GeraldLoeffler picture GeraldLoeffler  ·  143Comments

jgm picture jgm  ·  117Comments

elliottslaughter picture elliottslaughter  ·  44Comments