Pandoc: HTML Writer: Output underline as <u> instead of <span class="underline">

Created on 16 Oct 2019  Â·  14Comments  Â·  Source: jgm/pandoc

Opening an issue per https://github.com/jgm/pandoc/pull/5805#issuecomment-542306918. As of that PR, a keyboard element is now represented as a Span with a kbd class, and the HTML Writer outputs using the native <kbd> element.

Pandoc partially implements Underline as a Span with an underline class. Currently, the HTML Writer outputs <span class="underline">. Shouldn’t it output the native <u> element instead?

What are some advantages of outputting native HTML elements?

  • Semantics: Native HTML elements have defined and well-known semantics.
  • Idempotence: Pandoc can convert more HTML to HTML in a lossless round trip.
  • Simpler templates: Currently, templates must give special treatment to classes by adding rules in internal or external stylesheets (see default HTML template, EPUB stylesheet). Moreover, those styles are only included for standalone output. Custom templates must be updated more frequently, depending on upstream changes in Pandoc.
  • Still works outside of standalone output.
  • Still works when CSS is disabled.
  • Interoperability: Documents with native HTML elements are more likely to be preserved in other applications. For example, consider rich text copying/pasting: Chrome copies inline styles (and even inlinifies applicable internal/external styles), while Firefox doesn’t. While both browsers copy class attributes, it doesn’t matter because the destination may not define the same styles. Demo

Previously:

  • <kbd>: #5796 #5805
  • <samp>, <abbr>, <dfn>, <mark>, <var>: #5792 #5793 #5795 #5797 #5799. Most of these elements can presumably be implemented easily by adding to the map added in #5805.
  • Underline: #2270 #2264 (#4633 #5044 #5135)

Most helpful comment

I'm not arguing that <u> is correct HTML markup for underline styling. I can read the spec on that too. The issue here is input formats and what was meant by them.

You are assuming that what ends up in the Pandoc AST is only meant to be _purely_ presentational. If we could guarantee that our input data only ever meant to style elements visually then yes <u> would be incorrect. But what about when the source data actually does intend to indicate a non-textual annotation. In many source documents in quite a few formats, there being no other markup suitable for this, the underline style does get overloaded for this meaning. In fact I would argue that if you leave out the kind of bozos who make typographical soup of their texts by overusing such markup and switch up bold/italic/underline just for the sake of variety, this is the most common usage on underline styling in documents I've seen coming from several formats.

All 14 comments

I don't see much of a downside. I'd say let's do it, unless someone sees a problem with this?

Well... from https://developer.mozilla.org/en-US/docs/Web/HTML/Element/u

This element used to be called the "Underline" element in older versions of HTML, and is still sometimes misused in this way. To underline text, you should instead apply a style that includes the CSS text-decoration property set to underline.

see also further down: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/u#Usage_notes

Thanks for linking to that; I didn't know. Well, maybe not such a good idea then?

I'm not sure that documentation should be a blocker here. The thing is Pandoc is converting from other formats many of which have even less semantic markup available. Sure in some cases it won't be an ideal use of the spec, but it will actually enable better usage for those that care. Who are we to block this just because it can also enable incorrect usage?

it will actually enable better usage for those that care.

I cannot think of any "better usage" that we get from outputting <u> instead of <span class="underline">?

Who are we to block this just because it can also enable incorrect usage?

In HTML5, pandoc outputting <u> when the user intended to output something underlined, would actually be incorrect usage.

Let me quote some more from MDN:

the original HTML Underline (<u>) element was deprecated in HTML 4; however, <u> was restored in HTML 5 with a new, semantic, meaning: to mark text as having some form of non-textual annotation applied.

Valid use cases for the <u> element include annotating spelling errors, applying a proper name mark to denote proper names in Chinese text, and other forms of annotation.

In most cases, you should use an element other than , such as:

  • <em> to denote stress emphasis

Btw. if you don't believe the MDN "documentation", you can also read it in the binding HTML5 standard.

So I think both <span class="underline"> and <em class="underline"> would be fine HTML5 outputs for the pandoc AST Span ("",["underline"],[]).

I'm not arguing that <u> is correct HTML markup for underline styling. I can read the spec on that too. The issue here is input formats and what was meant by them.

You are assuming that what ends up in the Pandoc AST is only meant to be _purely_ presentational. If we could guarantee that our input data only ever meant to style elements visually then yes <u> would be incorrect. But what about when the source data actually does intend to indicate a non-textual annotation. In many source documents in quite a few formats, there being no other markup suitable for this, the underline style does get overloaded for this meaning. In fact I would argue that if you leave out the kind of bozos who make typographical soup of their texts by overusing such markup and switch up bold/italic/underline just for the sake of variety, this is the most common usage on underline styling in documents I've seen coming from several formats.

We're quickly getting into the weeds of semantics here... but wouldn't you agree that the vast majority of input documents (I'm thinking of word files for example) use underline to indicate some kind of emphasis and not an "unarticulated, non-textual annotation, such as labeling the text as being a proper name in Chinese text (a Chinese proper name mark), or labeling the text as being misspelt." Or could you add a couple of more example use-cases you've seen in the wild?

Maybe you have a lot more sophisticated document authors than I've come to know. But I would argue that you could always use a pandoc filter if you know your input document uses underline for that purpose.

Maybe I'm arguing that pandoc (in its default configuration) should be able to handle input documents from "the kind of bozos who make typographical soup of their texts by overusing such markup and switch up bold/italic/underline just for the sake of variety".

Personally I try to keep away from Word and the kind of pond scum markup it tends to produce. I do agree a lot of miss-use happens out there in the wild, but I don't think that should mean the rest of us can't have nice things.

Garbage in garbage out. I don't object to Pandoc being able to _handle_ content generated by 'that kind of bozo' and even trying to be smart about it, but I do object to it doing so at the expense of not being able to pass better markup through it without being dumbed-down. If you pass it garbage I don't expect it to magically make everything better. Frankly I don't really care how semantically correct the output is if the input had a bunch of underlines that were just for style. If you're nuts enough to do that you deserve the markup you get. (Okay I'm being a bit dramatic here but bear with me.)

But good content in should mean good content out. Don't limit the ability to output good content just because some people might (are likely to) misuse it.

The two tools that authors should reach for to emphasis and draw attention to things should be italic and bold (leaving aside the <i> vs. <em> and <b> vs. <strong> issue for now). An original source document should virtually never have underlines.

The most common use for underlines in documents I see is to redact things later. An editor might underline bits that need to be re-written, that they are going to ask a question about later, etc. They might underline all instances of some word that is overused to show how often it appears. For all of these uses italic and bold are not the right tools for the job, those should be reserved for the text itself to communicate with. Underlining and the concept of annotations go together. Perhaps there is not semantic information in the document to say what the meaning is, but somewhere the redactor is going to explain what their annotations are about.

So that's it, in my book (and the publishing company I run) underlines _are_ annotations. Whether they are directly linked to a sidebar comment or used to highlight something that will be referenced out of band or whatever, underlining is not for visual effect it is to annotate. Since many input formats don't have highlight or comment or other annotation syntax at all, underline is the most commonly available tool that does not conflict with the kind of markup that the base content might have.

What input format do you use to feed to pandoc then? And what tool/editor to author that?

A few comments here:

  1. Using CSS underlining on a <span> to represent emphasis is already improper markup (in the HTML world); <em> should be used instead. So those documents are going to be bad from an HTML spec perspective regardless; using <u> instead of <span> is a little worse, but just barely.

    The statement that “<u> shouldn't be used to underline” is primarily intended to warn against people doing, e.g., <h1><u>My Title</u></h1> instead of styling the h1 with CSS. I think most of the time, in a Markdown or other content-focused document, underlines can be assumed to be semantic and not presentational, and this isn't a problem. Things get more iffy when converting from more presentational formats like Word. (But things will always be iffy here; there simply isn't the semantic information to do a good job.)

  2. For those not familiar with the HTML5 revamping of old HTML3.2 elements, <b> and <i> are also assigned special semantic (non-presentational) meaning:

    • <b> is used to bring attention to a segment of text without assigning it special importance or emphasis. <b> is the correct element to use for drop-caps, article ledes, and marking keywords in texts. It is semantically distinct from <strong>.

    • <i> is used to “set off” text from its surrounding context for meaningful reasons other than emphasis. <i> is the correct element to use for English ship names, character thoughts, and (normally italicized) phrases from another language. It is semantically distinct from <em>.

    For people working primarily in HTML and Markdown, being able to signify / preserve these semantics round-trip is very useful. Certainly, if one is trying to write a book which makes frequent use of italicized character thoughts, the <i>/<em> distinction is important for correct HTML markup. However, similar to the situation with .underline/<u>, these distinctions make much less sense once you start moving to more presentational formats like Word or even LaTeX. Pandoc currently converts <i> to Emph and <b> to Strong when the input format is HTML.

    For round-tripping purposes it might be worthwhile to wrap all three in a certain sort of span, so that, for example, a Emph inside of a Span with a class of (e.g.) simple would be written as <i> instead of <em>, and an underlined Span inside of a Span with a class of simple would be written as <u>. However, underlined Spans in other contexts could continue being treated as they currently are, in order to keep from adding semantic information which shouldn't be there.

    (Personal note: I use Pandoc with custom filters for converting from Markdown to HTML, and while it is possible to write <i> using inline HTML literals in Markdown, it is cumbersome to write filters for dealing with this and it obviously does not round-trip.)

Summary of above proposal:

<i lang="fr">ç'est la vie</i>
<b>keyword</b>
<u>mispelt</u>
<span class="underline">underline</span>

becomes (as Markdown):

[*ç'est la vie*]{.simple lang="fr"}
[**keyword**]{.simple}
[[mispelt]{.underline}]{.simple}
[underline]{.underline}

Using CSS underlining on a <span> to represent emphasis is already improper markup (in the HTML world); <em> should be used instead.

good point, so the pandoc AST Underline [Str "foo"] should be rendered to HTML as the following then?

<em class="underline">foo</em>

Personally I think we should just use <u>. We already have an Emph element for emphasis.

The spec says

to mark text as having some form of non-textual annotation applied. Valid use cases for the <u> element include annotating spelling errors, applying a proper name mark to denote proper names in Chinese text, and other forms of annotation.

This seems fairly general and may encompass most of the uses for which people need underlining instead of just generic emphasis.

The way I see it, main argument:

  • in favour of <u>: nice to have a dedicated element that's analogous to the pandoc AST element (for round-tripping etc.) and browsers' default styling is underline (due to legacy reasons)
  • against: we're not outputting <b> and <i> either, so why <u>?

Because in order to know what "the right thing" to output is, we would need to know what the pandoc user intended with "underline". Probably there are people that really use it as an annotation (but then why not <mark>?) and there are people that (unfortunately) just use it as a third level of emphasis. And we won't get it correct in 100% of the cases either way.

But yes, I'm fine with giving the <u> a try and hear feedback from pandoc users how they're using it.

Since #6277 was merged, both Pandoc and the HTML Writer now finally natively support Underline (very happy, thank you!), so this issue seems to be fixed and is ready to be closed. For reference, the relevant changed line in the HTML Writer was:
https://github.com/jgm/pandoc/blob/c815d2f2284cb1d9f07cad76ee877ba7a928ad6b/src/Text/Pandoc/Writers/HTML.hs#L1066

I may file a new issue about inconsistencies in Underline output among Markdown flavors since 2.10.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

guifh picture guifh  Â·  4Comments

johnridesabike picture johnridesabike  Â·  4Comments

krobelus picture krobelus  Â·  4Comments

RLesur picture RLesur  Â·  3Comments

timtroendle picture timtroendle  Â·  3Comments