Pandoc: Pandoc keeps span and div tags when converting html to org

Created on 28 Jun 2017 · 20Comments · Source: jgm/pandoc

Version: 1.19.2.1

pandoc -f html -t org and pandoc -f html -t org-raw_html-native_divs-native_spans make no difference.

Input html:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title></title>
</head>
<body>
  <div class="Section1">
    <p class="Question"><span style="FONT-SIZE: 10pt">Today</span> <span style=
    "FONT-SIZE: 10pt">is</span> <span lang="HR" style=
    "FONT-SIZE: 10pt; mso-ansi-language: HR">a</span><span style=
    "FONT-SIZE: 10pt">nice</span> <span style="FONT-SIZE: 10pt">day</span> 
    </p>
  </div>
</body>
</html>

Both output org:

#+BEGIN_HTML
  <div class="Section1">
#+END_HTML

Today is anice day

#+BEGIN_HTML
  </div>
#+END_HTML

Expected:

Today is anice day

pandoc -f html -t markdown-raw_html-native_divs-native_spans seems to be no problem.

Org-mode writer

Source

tshu-w

Most helpful comment

Jgm pointed out on the mailing list that adding disabling the native_divs extension in the reader is a good work around for this:

pandoc -f html-native_divs -t org …

tarleb on 1 Sep 2017

👍4

All 20 comments

Can you say why you expected what you expected (rather than what you got)?

jgm on 28 Jun 2017

The question is just how native spans should be rendered in Org mode. I don't know enough about Org mode to know all the options - pinging @tarleb.

jgm on 28 Jun 2017

One method would be to use special blocks, like this:

#+ATTR_HTML: :class all classes but the first :key-value pairs
#+BEGIN_section1
here goes the content
#+END_section1

I've shied away from this solution for two reasons:

Nesting is not possible (i.e., problems will result if the first class of nested divs are the same).
Exporting from org will be ok when writing HTML, but LaTeX will give \begin{section1}…\end{section1}, which might be unexpected.

It might be the lesser evil though. Input of other org-mode users would be appreciated.

tarleb on 28 Jun 2017

There is no need for these DIVs to be represented in the Org output in the first place. Org organizes content by headings, which are H1, H2, etc. in HTML, and those are already converted correctly. DIVs like these are primarily used in HTML to organize and style content with CSS, which is irrelevant to Org.

These DIVs are simply useless clutter in the converted Org output. Instead of a user getting a clean, useful conversion of an HTML document to Org, he must spend extra time manually removing these irrelevant HTML DIV code blocks.

Previous Pandoc versions did not output this spurious HTML content; e.g. I'm using Pandoc 1.12.2.1 from Ubuntu 14.04, and it outputs only useful content (see example screenshot at https://github.com/alphapapa/org-protocol-capture-html).

I'm sure the ability to track and output these HTML DIVs in some output formats is very useful. But it's actually a detriment to Org output. And, as @tarleb pointed out, they cannot be nested properly in Org's native syntax. If there's a way to preserve them with extra options, that's great, but that's definitely a corner-case compared to a user simply wanting to capture the plain-text and outline-structured content from a simple HTML page. No one is expecting HTML->Pandoc->Org->Pandoc->HTML to produce a 1:1 conversion. So these HTML DIVs should simply be disabled for normal org output.

Thanks.

alphapapa on 29 Jun 2017

👍1

From my point, there are ways to meet different needs will be nice.
Like markdown, I can avoid span and div tags by turning off the raw_html, native_divs and native_spans extensions.

tshu-w on 29 Jun 2017

I'm sympathetic to the idea that native Divs should not come across as raw HTML in Org.
But what about Divs with an id attribute that serve as anchors?
Is there a way to represent internal anchors in Org?

jgm on 29 Jun 2017

Divs with only an id and no class or key-value pairs are currently unwrapped and the id is inserted as an <<anchor>>. Furthermore, if one of the classes is either quote or center, we wrap the content in the respective block type. If one of the classes is drawer, we create a drawer.

My current preference is to add a rule checking if the content is a single block, and to prefix that block with #+ATTR_HTML if that's the case. Otherwise, we should just keep the id if present and just output the content. This should keep the important information without shoehorning divs into the output. Does this seem reasonable?

tarleb on 29 Jun 2017

+++ Albert Krewinkel [Jun 29 17 03:32 ]:

Divs with only an id and no class or key-value pairs are currently
unwrapped and the id is inserted as an <>. Furthermore, if one
of the classes is either quote or verse, we wrap the content in the
respective block type. If one of the classes is drawer, we create a
drawer.

My current preference is to add a rule checking if the content is a
single block, and to add #+ATTR_HTML if that's the case. Otherwise, we
should just keep the id if present and just output the content. Does
this seem reasonable?

What do you mean, "checking if the content is a single block"?

jgm on 29 Jun 2017

Divs with only an id and no class or key-value pairs are currently unwrapped and the id is inserted as an <<anchor>>.

That seems sensible. If it's feasible, it would be nice to be able to disable the anchors in the output, as well.

Furthermore, if one of the classes is either quote or center, we wrap the content in the respective block type.

Seems like a clever hack to avoid parsing CSS while still getting the idea. Nice.

If one of the classes is drawer, we create a drawer.

I'm not sure about this one. Drawers are peculiar to Org-mode, and I wouldn't expect HTML elements with a class name drawer to be the equivalent., unless the author of the HTML was an Org user so fond of drawers that he reimplemented them in his HTML/CSS. :)

My current preference is to add a rule checking if the content is a single block, and to prefix that block with #+ATTR_HTML if that's the case.

I guess by "single block" you mean a non-nested DIV, i.e. a DIV containing no other DIVs?

I guess that's a decent compromise, as it allows users to remove the raw HTML by removing a single line from the Org output, instead of having to unwrap text from a HTML block. It would still be nice to have some kind of text-and-outline-structure-only mode that would leave out any raw HTML. Unless the user is planning to reconvert the Org to HTML for republishing (unlikely, if he's not the author, and the author would have the HTML to begin with), he probably will have no use for HTML like that in the output.

Otherwise, we should just keep the id if present and just output the content. This should keep the important information without shoehorning divs into the output. Does this seem reasonable?

I guess you mean, if the DIVs are nested, only output the <<anchor>>, and ignore the DIV attributes?

That's fine with me, although it seems inconsistent. But nesting DIVs in Org syntax isn't feasible, and I don't think trying to hack around that would make sense. HTML-to-Org is practically a one-way conversion for capturing or archiving information in an informal way, so it doesn't need raw HTML in the first place.

Thanks.

alphapapa on 30 Jun 2017

By the way, I just found this question on Emacs.SE about this very problem: https://emacs.stackexchange.com/questions/24676/html-to-orgmode-via-pandoc-get-rid-of-all-begin-html-blocks It seems like disabling the output of DIVs in the Org output format would help a lot of people.

@zeltak FYI. :)

alphapapa on 2 Jul 2017

thx @alphapapa :)

zeltak on 2 Jul 2017

Thanks @alphapapa, this is helpful.

The special handling of drawers was added to reduce information loss during org→org translations in pandoc: reading a drawer with pandoc's org reader gives a div containing drawer as one of it's classes to make styling easier when writing as HTML via pandoc.

@jgm: with checking if the content is a single block, I meant to inspect whether the div contains just a single block in the pandoc sense. The downside of this approach is that it would fail with some block types (e.g., lists, org is weird that way) and that it becomes more difficult for users to understand why attributes are retained for some divs but not for others.

I guess I agree with @alphapapa that it's better to keep special cases to a minimum here and that dropping everything but the div's id is the best option.

@alphapapa: We are currently integrate a lua-based filtering system into pandoc; removing unwanted information should become as easy as writing three short lines of lua code.

tarleb on 3 Jul 2017

@tarleb Ah, that's clever about the drawers. And thanks, the Lua filtering sounds great. I assume that the filters can be passed as command-line options? That would be ideal for my use-case.

alphapapa on 3 Jul 2017

@tarleb Thank you very much. Which version of Pandoc will that end up in?

alphapapa on 1 Sep 2017

This will be shipped with pandoc v2.0. You can download a nightly from the inofficial nightly builds repo if you want to test this without building pandoc from source.

tarleb on 1 Sep 2017

👍2

Jgm pointed out on the mailing list that adding disabling the native_divs extension in the reader is a good work around for this:

pandoc -f html-native_divs -t org …

tarleb on 1 Sep 2017

👍4

Is there an easy way to find out which version that was added to? The version I've got on my Ubuntu Trusty system doesn't have that extension.

alphapapa on 1 Sep 2017

I checked the git log: it seems that the native_divs extension was added with pandoc 1.13.

tarleb on 2 Sep 2017

👍1

@alphapapa The changelog is quite comprehensive.

vyp on 2 Sep 2017

👍1

You could look at the cumulative changelog (available on pandoc.org
under Releases).

+++ alphapapa [Sep 01 17 21:38 ]:

Is there an easy way to find out which version that was added to? The
version I've got on my Ubuntu Trusty system doesn't have that
extension.

—
You are receiving this because you were mentioned.
Reply to this email directly, [1]view it on GitHub, or [2]mute the
thread.

References

https://github.com/jgm/pandoc/issues/3771#issuecomment-326689392

https://github.com/notifications/unsubscribe-auth/AAAL5KTSB45pE1PmAQXkPekjftZ8dsTkks5seHlygaJpZM4OHbBj