Pandoc: Using <i> and <b> in HTML5

Created on 24 Jan 2018  Â·  12Comments  Â·  Source: jgm/pandoc

The generic use of <em> and <strong> for italics and bold technically runs against the HTML5 specification. Under the description of <em>, it states:

The <em> element isn’t a generic "italics" element. Sometimes, text is intended to stand out from the rest of the paragraph, as if it was in a different mood or voice. For this, the <i> element is more appropriate.

The same is true for <strong> (though they note that <b> 'should be used as a last resort'). For further discussion, see the W3C's 'Using <b> and <i> elements' page.

I can think of a few possible ways of mitigating this, and there are probably other approaches:

  1. Leave things as-is, under the presumption that eighty per cent of uses probably conform to the intention of <em> and <strong>, but try to detect other cases. For example, titles within citations could be converted to HTML using <cite>; italicized words with a lang attribute set to something other than the main language could be converted to <i class="foreignphrase">. (I am thinking of the <foreign> element in TEI.)
  2. Allow the user to add classes prefixed with something like i: or b: to change a specific use to plain italics/bold and apply that class, perhaps in combination with the above.
  3. Use <i> and <b> for everything, or provide an option to control this if there is concern over breakage.
HTML

All 12 comments

From http://spec.commonmark.org/0.28/#emphasis-and-strong-emphasis

John Gruber’s original Markdown syntax description says:

Markdown treats asterisks (*) and underscores (_) as indicators of emphasis. Text wrapped with one * or _ will be wrapped with an HTML <em> tag; double *’s or _’s will be wrapped with an HTML <strong> tag.

Those are heavy precedents, therefore I think option 1 is most appropriate.

I'm not really sure what the worry is here -- perhaps you could give a specific example?
If you're targeting HTML you can always use spans with attributes and fully control their formatting using CSS. So, e.g. a [mot]{lang=fr} in French.

The original Markdown description was written with only <em> and <strong> in mind because the use of <i> and <b> were heavily discouraged by the W3C at that time. HTML5 gave the them a new 'semantic' meaning, and now states that it's incorrect to use <em> for something other than emphasis. It would be nice to avoid this where it can be automated.

If I were to use Markdown to indicate a foreign word, I would need to write:

These institutions are known as *[caisses]{lang=fr}*.

Right now, this becomes:

These institutions are known as <em><span lang="fr">caisses</span></em>.

But according to the W3C guidance, it should be rendered using <i> plus ideally some class, like this:

These institutions are known as <i class="foreignphrase" lang="fr">caisses</i>.

I should add, similarly, that one could write e.g [*Pride and Prejudice*]{.cite} to receive <cite>Pride and Prejudice</cite>; the variable [*n*]{.var} for the variable <var>n</var>, and so forth. Right now, one can't just write <cite>Pride and Prejudice</cite> in Markdown if one is in a situation of needing these tags, because they then won't be recognized and rendered as italics in exporting to non-HTML formats.

+++ Andrew Dunning [Jan 24 18 09:48 ]:

These institutions are known as [caisses]{lang=fr}.

Right now, this becomes:

These institutions are known as caisses.

But according to the [1]W3C guidance, it should be rendered using
plus ideally some class, like this:

These institutions are known as caisses.

We could easily change the HTML writer to produce an i
class="foreignphrase" instead of a span for a Span element
with lang attribute.

The only reservation I'd have about this would be breaking
things for people who are relying on the old behavior.

@adunning one approach would be to use a simple lua filter that turns

Span ("",["cite"],[]) ils

into

[RawInline (Format "html") "<cite>"] ++ ils ++ [RawInline (Format "html") "</cite>"]

if the output format is html5, and

Emph ils

otherwise. This would be a trivial filter to write.

Generally, I'm a strong believer in semantic HTML. But the use of both <em> and <i> in the same document often leads to more trouble than it's worth. When to use which is hard for a human – and near-impossible for a markdown-engine like pandoc – to figure out. For your custom use-case where you know what you're doing you can always use a lua-filter as mentioned above.

Thus I'm closing this issue. @jgm of course, feel free to reopen if you feel otherwise.

I was about to open this same issue but I'm glad I searched through the old ones first and found this thread about it.

Please consider reopening it.

I think one star should be <i> and two should be <b>. It's not wrong to use <i> for emphasis even though <em> is better. It is wrong to use <em> for things other than emphasis. Which I currently do on the daily since I use Pandoc so much; including to export to a site (they run Vanilla Forums) that don't even recognize <cite>; they do recognize <i> though so that would be less wrong than <em>.

I'm sure Gruber could be brought on board with this position. I'd also ask him to add the unicode glyph "•" as one of the syntax options for unordered lists. Markdown is so great.

I would appreciate a command-line switch to output b and i instead of strong and em in HTML (i.e. option 3 above), as I primarily convert Word documents in which bold represents distinctive information rather than emphasis, and italics represents titles rather than emphasis 99% of the time.

Yeah, I don't want both, I just want b and i all the time instead of strong and em. The former is never wrong. The latter is sometimes _more_ right (more specific, more semantic, more better) but more often just outright wrong.

You can change the behavior rather easily with a small lua script (~6 lines).

You might consider bringing up this issue on pandoc-discuss, rather than a closed issue where nobody will see it.

Was this page helpful?
0 / 5 - 0 ratings