Pandoc: Using and in HTML5

Created on 24 Jan 2018 · 12Comments · Source: jgm/pandoc

The generic use of  and  for italics and bold technically runs against the HTML5 specification. Under the description of , it states:

The  element isn’t a generic "italics" element. Sometimes, text is intended to stand out from the rest of the paragraph, as if it was in a different mood or voice. For this, the  element is more appropriate.

The same is true for  (though they note that  'should be used as a last resort'). For further discussion, see the W3C's 'Using  and  elements' page.

I can think of a few possible ways of mitigating this, and there are probably other approaches:

Leave things as-is, under the presumption that eighty per cent of uses probably conform to the intention of  and , but try to detect other cases. For example, titles within citations could be converted to HTML using <cite>; italicized words with a lang attribute set to something other than the main language could be converted to . (I am thinking of the <foreign> element in TEI.)
Allow the user to add classes prefixed with something like i: or b: to change a specific use to plain italics/bold and apply that class, perhaps in combination with the above.
Use  and  for everything, or provide an option to control this if there is concern over breakage.

HTML

Source

adunning

👍2

All 12 comments

From http://spec.commonmark.org/0.28/#emphasis-and-strong-emphasis

John Gruber’s original Markdown syntax description says:

Markdown treats asterisks (*) and underscores (_) as indicators of emphasis. Text wrapped with one * or _ will be wrapped with an HTML  tag; double *’s or _’s will be wrapped with an HTML  tag.

Those are heavy precedents, therefore I think option 1 is most appropriate.

mb21 on 24 Jan 2018

👍1

I'm not really sure what the worry is here -- perhaps you could give a specific example?
If you're targeting HTML you can always use spans with attributes and fully control their formatting using CSS. So, e.g. a [mot]{lang=fr} in French.

jgm on 24 Jan 2018

The original Markdown description was written with only  and  in mind because the use of  and  were heavily discouraged by the W3C at that time. HTML5 gave the them a new 'semantic' meaning, and now states that it's incorrect to use  for something other than emphasis. It would be nice to avoid this where it can be automated.

If I were to use Markdown to indicate a foreign word, I would need to write:

These institutions are known as *[caisses]{lang=fr}*.

Right now, this becomes:

These institutions are known as <em><span lang="fr">caisses</span></em>.

But according to the W3C guidance, it should be rendered using  plus ideally some class, like this:

These institutions are known as <i class="foreignphrase" lang="fr">caisses</i>.

adunning on 24 Jan 2018

👍1

I should add, similarly, that one could write e.g [*Pride and Prejudice*]{.cite} to receive <cite>Pride and Prejudice</cite>; the variable [*n*]{.var} for the variable <var>n</var>, and so forth. Right now, one can't just write <cite>Pride and Prejudice</cite> in Markdown if one is in a situation of needing these tags, because they then won't be recognized and rendered as italics in exporting to non-HTML formats.

adunning on 24 Jan 2018

+++ Andrew Dunning [Jan 24 18 09:48 ]:

These institutions are known as [caisses]{lang=fr}.

Right now, this becomes:

These institutions are known as caisses.

But according to the [1]W3C guidance, it should be rendered using
plus ideally some class, like this:

These institutions are known as caisses.

We could easily change the HTML writer to produce an i
class="foreignphrase" instead of a span for a Span element
with lang attribute.

The only reservation I'd have about this would be breaking
things for people who are relying on the old behavior.

jgm on 24 Jan 2018

@adunning one approach would be to use a simple lua filter that turns

Span ("",["cite"],[]) ils

into

[RawInline (Format "html") "<cite>"] ++ ils ++ [RawInline (Format "html") "</cite>"]

if the output format is html5, and

Emph ils

otherwise. This would be a trivial filter to write.

jgm on 24 Jan 2018

Generally, I'm a strong believer in semantic HTML. But the use of both  and  in the same document often leads to more trouble than it's worth. When to use which is hard for a human – and near-impossible for a markdown-engine like pandoc – to figure out. For your custom use-case where you know what you're doing you can always use a lua-filter as mentioned above.

Thus I'm closing this issue. @jgm of course, feel free to reopen if you feel otherwise.

mb21 on 11 Mar 2018

😕1

I was about to open this same issue but I'm glad I searched through the old ones first and found this thread about it.

Please consider reopening it.

I think one star should be  and two should be . It's not wrong to use  for emphasis even though  is better. It is wrong to use  for things other than emphasis. Which I currently do on the daily since I use Pandoc so much; including to export to a site (they run Vanilla Forums) that don't even recognize <cite>; they do recognize  though so that would be less wrong than .

I'm sure Gruber could be brought on board with this position. I'd also ask him to add the unicode glyph "•" as one of the syntax options for unordered lists. Markdown is so great.

snan on 1 May 2019

I would appreciate a command-line switch to output b and i instead of strong and em in HTML (i.e. option 3 above), as I primarily convert Word documents in which bold represents distinctive information rather than emphasis, and italics represents titles rather than emphasis 99% of the time.

hftf on 24 Sep 2019

Yeah, I don't want both, I just want b and i all the time instead of strong and em. The former is never wrong. The latter is sometimes _more_ right (more specific, more semantic, more better) but more often just outright wrong.

snan on 24 Sep 2019

You can change the behavior rather easily with a small lua script (~6 lines).

jgm on 24 Sep 2019

You might consider bringing up this issue on pandoc-discuss, rather than a closed issue where nobody will see it.

jgm on 24 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

table 21% width on conversion from markdown to html

cnblogs-dudu · 5Comments

Does Pandoc plan to support converting to/from Wolfram/Mathematica notebooks?

georgewsinger · 4Comments

Non-breaking spaces in HTML ignored

timtroendle · 3Comments

Latex reader fails parse nested tables

krobelus · 4Comments

Org Mode export to Dokuwiki with R Source Code Block drops language name

RyanGreenup · 4Comments

Pandoc: Using <i> and <b> in HTML5

All 12 comments

Related issues