Pandoc: tag not closed correctly or invented 😳

Created on 9 Jul 2020 · 14Comments · Source: jgm/pandoc

I noticed that sometimes the  tag is not closed in the correct place or that it is inserted where it does not exist. 😳
It may also happen for other tags (a.e. ) but for now I have noticed this behavior only on .

The problem occurs with both version 2.9.2.1 than branch with build 2.10 (#6511 ) of @lierdakil (which handles "negative inline formatting").

The anomalous behavior occurs with both the "-f docx" and "-f docx+styles" option

I am attaching a docx where the problem arises.
(The file was saved in LibreOffice but the original, coming from Word, also had the same problem)

bug_strong

_Result_

I. Cantico dei Cantici, di Salomone. 1 Mi baci con i baci della sua bocca! Sì migliore del vino è il tuo amore. Inebrianti sono i tuoi profumi per la fragranza, Aroma che si spande è il tuo nome: per questo le ragazze di te si innamorano. Trascinami con te, corriamo!

 is closed after "fragranza," instead of "bocca!"

The Bride's Dream. 1 On my bed by night I sought him whom my soul loves; I sought him, but found him not. I will rise now and go about the city, in the streets and in the squares; I will seek him whom my soul loves.

Inexistent  is created on "I will rise [...] loves."

Thank you

Files:
bug_strong_not_closed_or_not_exist.docx

Docx reader

Source

SarahSiani-IT98

Most helpful comment

Okay, I've looked at the spec more closely (ECMA-376 5th edition Part 1). Here's what we have there:

17.3.2.26 rFonts (Run Fonts)
This element specifies the fonts which shall be used to display the text contents of this run. Within a single run,
there can be up to four types of font slot which shall each be allowed to use a unique font:

ASCII (i.e., the first 128 Unicode code points)

High ANSI

Complex Script

East Asian

and below:

For each Unicode character in a run, the font slot can be determined using the following two-step methodology:

Use the table below to decide the classification of the content, based on its Unicode code point.

If, after the first step, the character falls into East Asian classification and the value of the hint attribute
is eastAsia, then the character should use East Asian font slot

Otherwise, if there is or in this run, then the character should use Complex
Script font slot, regardless of its Unicode code point.

Otherwise, the character is decided using the font slot that is corresponding to the
classification in the table above.

Once the font slot for the run has been determined using the above steps, the appropriate formatting elements
(either complex script or non-complex script) will affect the content.

There's also a helpful diagram in there:

The table in question is a bit of a mess with ample exceptions and special cases, but the interesting part is that it doesn't directly determine whether the character is considered complex script or not. Also interesting point that w:hint="cs" is ignored according to spec.

Also, in I.3 WordprocessingML we have:

bCs [...] Specifies the bold property for a complex script run of characters, this is applied
when the “rtl” element is specified on a run. It is forced when the “cs” element is
specified (see the “cs” element later in this table).

So based on all this I think we can check for w:rtl and w:cs and be done with this.

lierdakil on 13 Jul 2020

🎉1 😄1 👍1

All 14 comments

when converting to HTML? Sounds like a bug in the HTML writer then... can you do -t native and find the troublesome part of the document and post it here?

(https://pandoc.org/MANUAL.html#description for background on native format)

mb21 on 9 Jul 2020

when converting to HTML? Sounds like a bug in the HTML writer then... can you do -t native and find the troublesome part of the document and post it here?

(https://pandoc.org/MANUAL.html#description for background on native format)

I don't think so because the problem also occurs with markdown

Now try -t native ...

SarahSiani-IT98 on 9 Jul 2020

-f docx -t markdown

***[I. Cantico dei Cantici, di Salomone.]{.ul}*** **[1]{.ul}** Mi **baci
con i baci della sua bocca! Sì migliore del vino è il tuo amore.
Inebrianti sono i tuoi profumi per la fragranza,** Aroma che si spande è
il tuo nome: per questo le ragazze di te si innamorano. Trascinami con
te, corriamo!

***[The Bride\'s Dream.]{.ul}*** **[1]{.ul}** On my bed by night **I
sought him whom my soul loves;** I sought him, but found him not. **I
will rise now and go about the city, in the streets and in the squares;
I will seek him whom my soul loves.**

-f docx -t native

[Para [Emph [Strong [Underline [Str "I.",Space,Str "Cantico",Space,Str "dei",Space,Str "Cantici,",Space,Str "di",Space,Str "Salomone."]]],Space,Strong [Underline [Str "1"]],Space,Str "Mi",Space,Strong [Str "baci",Space,Str "con",Space,Str "i",Space,Str "baci",Space,Str "della",Space,Str "sua",Space,Str "bocca!",Space,Str "S\236",Space,Str "migliore",Space,Str "del",Space,Str "vino",Space,Str "\232",Space,Str "il",Space,Str "tuo",Space,Str "amore.",Space,Str "Inebrianti",Space,Str "sono",Space,Str "i",Space,Str "tuoi",Space,Str "profumi",Space,Str "per",Space,Str "la",Space,Str "fragranza,"],Space,Str "Aroma",Space,Str "che",Space,Str "si",Space,Str "spande",Space,Str "\232",Space,Str "il",Space,Str "tuo",Space,Str "nome:",Space,Str "per",Space,Str "questo",Space,Str "le",Space,Str "ragazze",Space,Str "di",Space,Str "te",Space,Str "si",Space,Str "innamorano.",Space,Str "Trascinami",Space,Str "con",Space,Str "te,",Space,Str "corriamo!"]
,Para [Emph [Strong [Underline [Str "The",Space,Str "Bride's",Space,Str "Dream."]]],Space,Strong [Underline [Str "1"]],Space,Str "On",Space,Str "my",Space,Str "bed",Space,Str "by",Space,Str "night",Space,Strong [Str "I",Space,Str "sought",Space,Str "him",Space,Str "whom",Space,Str "my",Space,Str "soul",Space,Str "loves;"],Space,Str "I",Space,Str "sought",Space,Str "him,",Space,Str "but",Space,Str "found",Space,Str "him",Space,Str "not.",Space,Strong [Str "I",Space,Str "will",Space,Str "rise",Space,Str "now",Space,Str "and",Space,Str "go",Space,Str "about",Space,Str "the",Space,Str "city,",Space,Str "in",Space,Str "the",Space,Str "streets",Space,Str "and",Space,Str "in",Space,Str "the",Space,Str "squares;",Space,Str "I",Space,Str "will",Space,Str "seek",Space,Str "him",Space,Str "whom",Space,Str "my",Space,Str "soul",Space,Str "loves."]]]

-f docx+styles -t native

[Div ("",[],[("custom-style","testo commento")])
 [Para [Span ("",[],[("custom-style","testo tit")]) [Str "I.",Space,Str "Cantico",Space,Str "dei",Space,Str "Cantici,",Space,Str "di",Space,Str "Salomone."],Space,Span ("",[],[("custom-style","numero_glossa")]) [Str "1"],Space,Str "Mi",Space,Strong [Str "baci",Space,Str "con",Space,Str "i",Space,Str "baci",Space,Str "della",Space,Str "sua",Space,Str "bocca!",Space,Str "S\236",Space,Str "migliore",Space,Str "del",Space,Str "vino",Space,Str "\232",Space,Str "il",Space,Str "tuo",Space,Str "amore.",Space,Str "Inebrianti",Space,Str "sono",Space,Str "i",Space,Str "tuoi",Space,Str "profumi",Space,Str "per",Space,Str "la",Space,Str "fragranza,"],Space,Str "Aroma",Space,Str "che",Space,Str "si",Space,Str "spande",Space,Str "\232",Space,Str "il",Space,Str "tuo",Space,Str "nome:",Space,Str "per",Space,Str "questo",Space,Str "le",Space,Str "ragazze",Space,Str "di",Space,Str "te",Space,Str "si",Space,Str "innamorano.",Space,Str "Trascinami",Space,Str "con",Space,Str "te,",Space,Str "corriamo!"]]
,Div ("",[],[("custom-style","testo commento")])
 [Para [Span ("",[],[("custom-style","testo tit")]) [Str "The",Space,Str "Bride's",Space,Str "Dream."],Space,Span ("",[],[("custom-style","numero_glossa Carattere")]) [Str "1"],Space,Str "On",Space,Str "my",Space,Str "bed",Space,Str "by",Space,Str "night",Space,Strong [Str "I",Space,Str "sought",Space,Str "him",Space,Str "whom",Space,Str "my",Space,Str "soul",Space,Str "loves;"],Space,Str "I",Space,Str "sought",Space,Str "him,",Space,Str "but",Space,Str "found",Space,Str "him",Space,Str "not.",Space,Strong [Str "I",Space,Str "will",Space,Str "rise",Space,Str "now",Space,Str "and",Space,Str "go",Space,Str "about",Space,Str "the",Space,Str "city,",Space,Str "in",Space,Str "the",Space,Str "streets",Space,Str "and",Space,Str "in",Space,Str "the",Space,Str "squares;",Space,Str "I",Space,Str "will",Space,Str "seek",Space,Str "him",Space,Str "whom",Space,Str "my",Space,Str "soul",Space,Str "loves."]]]]

SarahSiani-IT98 on 9 Jul 2020

As a "test" (_for what it's worth_): I exported to html from word processor (both Word and LibreOffice) and the html is rendered correctly.
(_I could look inside the docx as it is made xml but I don't know read it_)

So I think the problem is in reading docx from pandoc.

SarahSiani-IT98 on 9 Jul 2020

So I think the problem is in reading docx from pandoc.

ah yes, you're right (I didn't read your initial post carefully enough).

mb21 on 9 Jul 2020

After a quick look through the XML, those runs that aren't rendered as bold but interpreted by Pandoc as such, have bCs property, which means "complex script bold". Pandoc currently lacks the ability to figure out which scripts are "complex" and which are not, and hence interprets both as bold unconditionally.

Generally speaking, "complex script" properties should coincide with "normal" properties anyway, so this is usually not an issue. But for some reason, here those are all out of whack: both styles and font sizes are different between "normal" and "complex script". I don't know why, but basically the docx itself is very slightly broken.

The easy way to "fix" that is to remove complex script support from the document (since it only contains Latin characters). Not sure how to do that in LibreOffice, but there are guides on doing that in Word (for instance this)

If anyone has ideas on how to actually detect if the text in the run is "complex script" or not, please tell. I don't have any save for querying the OS API, which isn't something Pandoc should be reasonably expected to do.

lierdakil on 9 Jul 2020

👍1

(I could look inside the docx as it is made xml but I don't know read it)

https://superuser.com/a/278262/247552 :-)

mb21 on 9 Jul 2020

After a quick look through the XML, those runs that aren't rendered as bold but interpreted by Pandoc as such, have bCs property, which means "complex script bold". Pandoc currently lacks the ability to figure out which scripts are "complex" and which are not, and hence interprets both as bold unconditionally.

Generally speaking, "complex script" properties should coincide with "normal" properties anyway, so this is usually not an issue. But for some reason, here those are all out of whack: both styles and font sizes are different between "normal" and "complex script". I don't know why, but basically the docx itself is very slightly broken.

The easy way to "fix" that is to remove complex script support from the document (since it only contains Latin characters). Not sure how to do that in LibreOffice, but there are guides on doing that in Word (for instance this)

If anyone has ideas on how to actually detect if the text in the run is "complex script" or not, please tell. I don't have any save for querying the OS API, which isn't something Pandoc should be reasonably expected to do.

I really don't know if this is the reason ...

I tried to remove "complex script" but it didn't work.
(setting english or italian)

I tried to save in .doc, .rtf, .odt and then return to .docx but it didn't work.

Maybe I'm doing something wrong ... but with 15,000 pages to process I'm feeling uncomfortable.

SarahSiani-IT98 on 9 Jul 2020

For a quick-fix I've published a patched version of Pandoc that adds ctl (for "complex text layout") extension to docx, enabled by default, which can be disabled to make Pandoc ignore complex script markers. I've implemented this on top of docx-styles-negative-inlines for the sake of simplicity. Get it with git clone -b docx-ctl-flag https://github.com/lierdakil/pandoc (or via direct download), build as usual, use with pandoc -f docx-ctl .... You can stack modifiers, i.e. pandoc -f docx-ctl+style ... or pandoc -f docx+style-ctl ... will work.

For a proper fix, I guess we could leverage libicu, by getting character script, and then determine if the script is complex or not via a lookup table... only issue is, besides an additional external dependency (i.e. libicu), Haskell bindings (text-icu) are apparently unmaintained and don't export the API to required query the character script. Could be possible to fake this based on character block, but there will be slight inconsistencies, so probably not a great idea generally speaking. I could slap together the required bindings, but I'm not sure the effort'd be worth it.

lierdakil on 9 Jul 2020

What is the meaning of "complex script" in docx?

I don't want to bind to text-icu; that would make pandoc much more difficult to install. Maybe we can find a good, 95% reliable heuristic.

I'd rather do something automatic than require an extension.

@jkr may want to comment as well.

jgm on 13 Jul 2020

@jgm, basically, some scripts (i.e. writing systems) make extensive use of contextual ligatures. Complex text layout is the generic umbrella term for handling those. "Complex script" is the term for scripts that generally require complex text layout support. Hopefully this answers the question somewhat.

Now, why does docx use different tags for complex scripts? Heck if I know. It's docx, it does a lot of weird stuff.

I'd rather do something automatic than require an extension.

That's a given. The only reason I did add the extension is I couldn't be bothered to bind to icu at the time.

Maybe we can find a good, 95% reliable heuristic.

Well, see, that's the issue. We already have a 95% reliable heuristic, which is to treat "complex script" tags the same as "regular" tags -- it mostly works, but here we're running into the case where it doesn't.

It's not impossible to write a pure Haskell implementation of isComplexScript :: Char -> Bool, based on data provided by icu, but it's potentially a giant hassle to maintain (basically, it'd need to be checked/updated every time there's a new Unicode standard)

lierdakil on 13 Jul 2020

Could we look at the document's language setting to detect this? Or that's not reliably set in docx? Also: doesn't this depend on the font used and whether kerning is enabled? Just asking question, because in the docx-world things probably aren't as you'd expect them to be :P

mb21 on 13 Jul 2020

Actually, I've based what I said up until now on OOXML spec and what LibreOffice does (because I've had it on hand basically). I've now checked with Word 2019 and it straight up ignores bCs tags in my documents regardless of the script I feed it. As does Google Docs. LibreOffice, however, does its own thing.

Google Docs:

Word 2019:

LibreOffice:

So I'm not sure on the semantics of complex scripts in docx anymore. Perhaps @remy33 who reported #4947 could offer us some insight. Otherwise, more research required.

lierdakil on 13 Jul 2020

Okay, I've looked at the spec more closely (ECMA-376 5th edition Part 1). Here's what we have there:

17.3.2.26 rFonts (Run Fonts)
This element specifies the fonts which shall be used to display the text contents of this run. Within a single run,
there can be up to four types of font slot which shall each be allowed to use a unique font:

ASCII (i.e., the first 128 Unicode code points)

High ANSI

Complex Script

East Asian

and below:

For each Unicode character in a run, the font slot can be determined using the following two-step methodology:

Use the table below to decide the classification of the content, based on its Unicode code point.

If, after the first step, the character falls into East Asian classification and the value of the hint attribute
is eastAsia, then the character should use East Asian font slot

Otherwise, if there is or in this run, then the character should use Complex
Script font slot, regardless of its Unicode code point.

Otherwise, the character is decided using the font slot that is corresponding to the
classification in the table above.

Once the font slot for the run has been determined using the above steps, the appropriate formatting elements
(either complex script or non-complex script) will affect the content.

There's also a helpful diagram in there:

Also, in I.3 WordprocessingML we have:

bCs [...] Specifies the bold property for a complex script run of characters, this is applied
when the “rtl” element is specified on a run. It is forced when the “cs” element is
specified (see the “cs” element later in this table).

So based on all this I think we can check for w:rtl and w:cs and be done with this.

lierdakil on 13 Jul 2020

🎉1 😄1 👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Generalized syntax for raw blocks in Markdown?

jgm · 48Comments

idea: include files (and csv tables)

anton-k · 53Comments

New Feature: internal links to tables and figures and headers

GeraldLoeffler · 143Comments

Syntax for specifying image size

jgm · 117Comments

Extracting more document structure from docx files

elliottslaughter · 44Comments

Pandoc: <strong> tag not closed correctly or invented 😳

Most helpful comment

All 14 comments

Related issues