Pandoc: parse small caps from HTML as a class

Created on 31 Aug 2014  Â·  24Comments  Â·  Source: jgm/pandoc

Writing small caps to a class in HTML, textile and markdown has the following benefit.

For HTML there are three ways of getting small caps:

  1. font-variant: small-caps; is the standard way in CSS2.
  2. font-feature-settings: "smcp";is the way to enable OpenType features in CSS3.
  3. font-family: MySmallCapsFont; is the way to get real small caps instead of the fake ones from the first option. Used in many ePub books.

Consider the following: if small caps is a class, the user would be able to adapt the best method with the available fonts. If small caps are hardcoded, there will be no other way than fake small caps.

BTW, small caps are lost when converting to textile (sample).

Many thanks for your help.

enhancement

All 24 comments

Could you be more specific about the change you're proposing?
What does pandoc do now (give example), and what would you propose
it do instead (give example)?

+++ Pablo Rodríguez [Aug 31 14 12:17 ]:

Writing small caps to a class in HTML, textile and markdown has the
following benefit.

For HTML there are three ways of getting small caps:

  1. font-variant: small-caps; is the standard way in CSS2.
  2. font-feature-settings: "smcp";is the way to enable OpenType
    features in CSS3.
  3. font-family: MySmallCapsFont; is the way to get real small caps
    instead of the fake ones from the first option. Used in many ePub
    books.

Consider the following: if small caps is a class, the user would be
able to adapt the best method with the available fonts. If small caps
are hardcoded, there will be no other way than fake small caps.

BTW, small caps are lost when converting to textile ([1]sample).

Many thanks for your help.

—
Reply to this email directly or [2]view it on GitHub.

References

  1. http://johnmacfarlane.net/pandoc/try/?text=a&from=markdown&to=textile
  2. https://github.com/jgm/pandoc/issues/1592

Sorry for not being more accurate.

From the code sample (common to both proposals):

<span style="font-variant: small-caps;">a</span>

pandoc gives the following HTML and pandoc (all versions):

<span style="font-variant: small-caps;">a</span>

I propose:

<span class="smallcaps">a</span>

So you can configure this with CSS and the three options I listed in the first message. If you have the output from pandoc, there is no possible configuration.

And from the code above, pandoc renders it into textlle as:


I propose:

<span class="smallcaps">a</span>

I hope it is clearer now. Let me know if it isn’t.

I see. That makes sense, and I think I agree. Note that any change here would need to be coordinated with pandoc-citeproc.

I agree using a class too. For backward compatibility, may be a default CSS (font-variant: small-caps;) should be given (before custom CSS) in all HTML related templates. Because of the cascading properties, user defined CSS can override this default behavior.

When deciding on which class to use, I suggest keeping it short. Together with the possibility given by the new bracketed spans (pull request #3191), it would make a very short and convenient native markdown small cap syntax.

I suggest using the short form already used in font-feature-settings: smcp. This way, it won't be too short/arbitrary to accidentally turn a class (that is not intended to be small caps) into native SmallCaps. The syntax combined with bracketed spans will then be [Small Caps]{.smcp}.

Looking at the code and studying the output of pandoc again, currently the markdown reader pretty much looks for exact match of <span style="font-variant:small-caps;">Small caps</span>, that has no id nor class (it allows other attributes, e.g. foo=bar. And even then:

printf '<span foo=bar style="font-variant:small-caps;">Small caps</span>' | pandoc -t native

will output

[Para [SmallCaps [Str "Small",Space,Str "caps"]]]

i.e. the foo=bar information is lost. And the SmallCaps has no sense of attributes.

SmallCaps having no attribute at all is ok in the old syntax. But now if we want to use, say class, to denote small caps instead, then it should have a sense on other attributes as well. In this case, does it mean it requires AST change?

For a moment I am considering implementing this feature once we settled for the name of class to use. But seeing that it probably requires an AST change, I am not up to the task.

If having a small-caps syntax that do not allow any other attributes is ok at least for now, then I might be able to contribute on this. I think only pandoc, pandoc-citeproc, pandoc-templates needed to be changed, correct?

Just to confirm the current AST doesn't support SmallCaps with attributes:

$ printf "%s" '[Para [SmallCaps ("",["test"],[]) [Str "testing"]]]' | pandoc -t markdown -f native
pandoc: Could not read: [Para [SmallCaps ("",["test"],[]) [Str "testing"]]]
CallStack (from HasCallStack):
  error, called at src/Text/Pandoc/Error.hs:55:28 in pandoc-1.18-HpgSxG1cAK27F85VcbC1Vb:Text.Pandoc.Error

@jgm said:

Thanks. I do like @ousia's suggestion to use a "small-caps" class.
Perhaps the condition could be: if the span (whether HTMLish or bracket) has either the class >"small-caps" (and no other attributes) or style="font-variant:small-caps" (and no other attributes), >it's treated as a small caps element.
Although perhaps @ousia is more concerned with HTML output than with how these things are input.

What I'm afraid is SmallCaps carries no attributes. So once it is specified by a small-caps class, then people will expects other attributes should carry over. And by the way, I think s shorter class might be better, say, smcp.

Just to reference it: #684. If all elements get attributes, SmallCaps get attributes, and this change is more natural.

[Sorry for my delayed reply.]

@jgm, my original proposal a smallcaps class. If there is an element, I think it should have special syntax (otherwise, users won’t have direct access to it).

@ickc, if the smallcaps element is specified by a class, what prevents from adding identifier, other classes and other attributes (such as language)?

Since the current SmallCaps in the AST is not a class (it's just an element called SmallCaps), and cannot include attributes.

Alternatively, one can remove the specific SmallCaps element in the AST and use generic span with class to handle it. But since there's no standard on which class to use, existing smallcaps are specified with style instead, and for backward compatibility, etc., promoting the current SmallCaps with attributes seems best.

If there is a SmallCaps element, how does the user invoke it?

I guess that emphasis with a small caps class attribute would make more sense. Otherwise, elements and formats are indistinguishable.

But in that case, emphasis should be granted attributes first.

If there is a SmallCaps element, how does the user invoke it?

From markdown it's a span with the style attribute:

[foo]{style=font-variant:small-caps}

@mb21, I thought this would allow other attributes (from the way the user invokes it, she only adds a class [that should be able to exist among other attributes]).

My question is: does it really make sense to have a SmallCaps element. Isn’t it a way of mixing format and content. Small caps is how the text should be displayed, not what it is. Author could be an element.

How about _emphasis_{.smallcaps}? From a text encoding perspective, I think it makes more sense.

@ousia sounds reasonable to me, however SmallCaps _is_ already in the AST (unfortunately without Attr) and has been for a couple of years. See https://github.com/jgm/pandoc/search?q=smallcaps&type=Commits

@ousia, if you try to do pandoc -t native with anything with small caps, you'll see a SmallCaps element there. i.e. as @mb21 said, it is already here long ago. So the issue is how to move forward while keeping backward compatibility. e.g. right now if you give more attributes to your markdown smallcaps, it will not be parsed into a small caps element but a general HTML span instead, meaning outputs other than HTML-like will not receive a SmallCaps.

That's why I think the more natural way to solve the problem is to grant the SmallCaps element attributes, which is part of #684's discussion.

@mb21, once an ithere is an element in the AST, cannot it be deprecated? I mean, it is an internal (interchange) format and no human document is supposed to be written in that format, or is it?

@ickc, I guess not only elements can be translated in other formats as HTML. Language is an attribute and it is also translated (or it should be) into other formats.

If [Deutsch]{lang=de} is not problematic, which is the problem of [Shakespeare]{.author}?

You may object that TeX cannot handle classes, but pandoc should translate this particular class (.smallcaps) to \textsc{}.

Otherwise, we are multiplying elements _ad infinitum_.

But of course, if we cannot get rid of the SmallCaps element, it should be granted attributes.

In that case, my question would be: if TeX cannot handle other attributes than language, which is the use of having attributes granted to the SmallCaps element?

But there are already a lot of things build around pandoc. e.g. the 6 filter frameworks. While deprecating SmallCaps or granting it attributes would both break them, I suppose adding attributes easier to be changed in filters and filter frameworks. There are proposal in the other thread that can hide the attributes if it is empty, in this case it gives better backward compatibility even when nothing is done on the filters / filter frameworks' side.

A lot of things depends on pandoc (which is a good thing), but it also means it is very hard to make backward incompatible change, and when made minimal impact has to be considered.

SmallCaps was added before we had Span with attributes, and
we needed it for pandoc-citeproc. If we were doing it now,
I'd probably inclined to use Span with attributes, but it's
better not to make breaking changes without a really strong
rationale.

I see the point. It would be nice to have SOME way of
indicating SmallCaps in pandoc's Markdown. The most
natural way would be

[my text]{.smallcaps}

But this already has the meaning of a Span with a class.

We could have the Markdown reader just parse this as a
SmallCaps element, but this would remove the ability
to have smallcaps as a class on a native span. Maybe
that's okay. If we did that we'd probably want to make
a related change, rendering in HTML with a class smallcaps
rather than a style attribute, and adding a default
definition for this class to the default header.

this would remove the ability to have smallcaps as a class on a native span.

You could still use raw HTML syntax which gets parsed by the markdown reader into a native span, right?

+++ Mauro Bieg [Mar 02 17 01:21 ]:

this would remove the ability to have smallcaps as a class on a
native span.

You could still use raw HTML syntax which gets parsed by the markdown
reader into a native span, right?

Yes, I suppose so.

+++ Mauro Bieg [Mar 02 17 01:21 ]:

this would remove the ability to have smallcaps as a class on a
native span.

You could still use raw HTML syntax which gets parsed by the markdown
reader into a native span, right?

Though I think it's ugly if the native span syntax and the
HTML syntax give different results in this case...

Do you imply attributes will be granted for SmallCaps?

I see the point. It would be nice to have SOME way of
indicating SmallCaps in pandoc's Markdown. The most
natural way would be

[my text]{.smallcaps}

But this already has the meaning of a Span with a class.

We could have the Markdown reader just parse this as a
SmallCaps element, but this would remove the ability
to have smallcaps as a class on a native span. Maybe
that's okay. If we did that we'd probably want to make
a related change, rendering in HTML with a class smallcaps
rather than a style attribute, and adding a default
definition for this class to the default header.

OK, I've implemented most of this.

What is still needed is a defau) definition for the smallcaps class in the following templates:

  • [ ] default.html4
  • [ ] default.html5
  • [ ] default.slidy
  • [ ] default.slideous
  • [ ] default.s5
  • [ ] default.revealjs
  • [ ] default.dzslides
  • [ ] default.epub2
  • [ ] default.epub3
Was this page helpful?
0 / 5 - 0 ratings

Related issues

GeraldLoeffler picture GeraldLoeffler  Â·  143Comments

stepht picture stepht  Â·  54Comments

jgm picture jgm  Â·  266Comments

brainchild0 picture brainchild0  Â·  66Comments

ERnsTL picture ERnsTL  Â·  58Comments