Pandoc: Add Option for Lossyness Reports

Created on 29 Jan 2017 · 14Comments · Source: jgm/pandoc

Because pandoc’s intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other. [...] While conversions from pandoc’s Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc’s Markdown can be expected to be lossy.

Currently there is no way to know if a given pandoc conversion is lossy or not. It would be nice to have an option to perform a dry-run conversion and display a report on elements loss, either:

The conversion from one format to another did not involve any loss of element, and a standard message of losslessness is displayed (on STDERR); or
Conversion to a less expressive format resulted in one or more elements being left out, flattened, assimilated into similar elements, removed, or whatever. A standard message of lossynes is displayed (on STDERR), and (optionally) a resume of the lost elements and their context (on STDOUT).

This option could be helpful to test formats before proceding with the actual conversion. Sometimes we simply get confused about the multiple formats, and might forget that a given element won't render in another format. The lossyness warning would be a better solution than manually checking if every element is present in the final output.

This would also be useful in big projects (especially if script-automated, like API documentation, etc), it would allow users to check (and control) wether elements are lost during the pipeline, and take counter measures if there are --- eg: a pre-conversion check might block a specific release if a lossyness warning is raised by pandoc, allowing maintainers to edit the source docs so that only elements that can make it through the conversion line are used.

Source

tajmone

👍1

All 14 comments

The new architecture we have in the typeclass branch (future pandoc 2.0) makes it much easier for all readers and writers to issue warnings and info messages. So this opens the way to add informative messages to all readers when there is lossiness. (Of course, adding these would be a nontrivial amount of work.) These info-level messages would be enabled by the --verbose option.

jgm on 29 Jan 2017

Of course comments welcome on the proposed system. What I have now is a logging mechanism with ERROR, WARNING, INFO, and DEBUG levels. The user will be able to select the level of verbosity. I also have a flag to treat warnings as errors; perhaps it would be worth while having another option to treat info messages as errors? Or perhaps lossiness indications should be warnings? (There will likely be many of them.)

jgm on 29 Jan 2017

The ideal is to have a system that could please both humans and scripts: the former with readability in mind, the latter intended for parsable output.

Exit Codes

At its very basic, an exit error level 0=lossless, >=1=lossy should satisfy both humans and scripts. Lossyness could be represented by setting flags on exit error. I don’t have a clear picture of all the possible losses an element can undergo in various output formats, but I am assuming these could all be loss cases (and some aproximate descriptors):

deletion: the whole element is lost during format translation. (eg: footnotes, a table?)
flattening/normalization: the element’s style if discarded but the text retained. (eg: striked text become plain)
conversion/assimilation: an element’s style is rendered with an aproximately similar style. (eg: inline code as bold)

Similar information is what the user might be looking for at the highest level, eg: if pandoc reports a >=1 exit code for the convertion, we check the various flags that make up the reternud code to check the presence of the above type of losses. Maybe in a given context any losses that don’t imply deletion of elements are ok and the conversion should go ahead.

So really, the difference between what is a warning or an error might be subjective according to usage context and expectations. But generally, I’d say that deletion of contents are more critical than style changes or removals.

Custom Reader/Writers

From the upcoming 2.0 changes you’ve mentioned, I then assume that also custom readers and writers will be able to employ this system. I’ve worked on a markdown to BBCode custom writer, and implement a manual warning sytem along these lines: table are lost completely, inline code is converted to bold, headers become bold text with different sizes, and so on. So, if this system is to be extended to custom reader and writers then it would need to consider all possible descriptors for lossyness cases.

Reports in JSON + Human Readable Format

As for the verbose report on the details of losses, JSON would be a good format for a scripted automation pipeline, and the same JSON structure could be printed out in human-readable markdown-formatted reported on request.

The JSON report could group losses according to loss-types, and for each loss provide a reference to the line in the original source, the original element, an maybe a string with the starting text that is affected (this is intended only for the human-readable version).

Eg: requesting human-readable report:

LOSSES REPORT:

- deletions (2)
- normalizations (4)
- conversions (11)

# DELETIONS

1.  ELEMENT DELETED: `table`
    LINE(S): 48-67.
    TEXT: "Table of Elements"

… just a speculative example, but it might represent the convience of having some standard to handle both JSON representation and a human readable mardkwon report (that should be easy to read also in terminal, as raw txt).

tajmone on 29 Jan 2017

+++ Tristano Ajmone [Jan 29 17 02:52 ]:

At its very basic, an exit error level 0=lossless, >=1=lossy should
satisfy both humans and scripts.

No, because it's standard in unix for 0 to mean "exited
without errors"; warnings shouldn't cause non-0 exit codes
unless a special flag is used, as I suggested.

to check the presence of the above type of losses. Maybe in a given
context any losses that don’t imply deletion of elements are ok and the
conversion should go ahead.

The way compilers usually handle this fine-grained
discrimination is by allowing each type of warning to be
selectively enabled or disabled by a command-line flag.

From the upcoming 2.0 changes you’ve mentioned, I then assume that also
custom readers and writers will be able to employ this system. I’ve

I haven't really thought about how to do this, but yes, I
think it should be possible to expose these functions to lua.

jgm on 29 Jan 2017

I've added the framework for this (much better warnings about omitted content + machine-readable warnings + an option to generate an error status code if there are warnings).

I've also added more warnings to readers and writers, so one now gets much fuller information (especially with --verbose). However, we're still pretty far from giving complete information about what is omitted/changed.

Eventually we should add warnings to all writers for raw blocks/inlines that are not rendered (because the formats don't match). Currently we've got this for the following writers:
docbook, docx, fb2, haddock, html, icml, latex, man, markdown, opendocument, rtf, texinfo.

To add to the other writers, we need to do a bit of replumbing so that the writers are in PandocMonad.

jgm on 17 Feb 2017

TODO:

Convert these writers to use PandocMonad:

[x] asciidoc
[x] commonmark
[x] context
[x] dokuwiki
[x] epub
[x] mediawiki
[x] odt
[x] opml
[x] org
[x] rst
[x] tei
[x] textile
[x] zimwiki

Also:

[x] Add warning report when skylighting returns an error.
[ ] Add a warning category for cases where we're not ignoring content, but interpreting it differently, e.g. underline as emphasis.
[ ] DocBook reader - warnings for unsupported elements

jgm on 25 Feb 2017

Hi! I hope this is related enough.

I'm currently working on an open source book and it has a build script in it's directory, so users don't have to directly type in the pandoc commands. But the source code is structured like this: Every chapter has it's own markdown file and they get converted to one big markdown file with pandoc in this build script. However, the footnotes in every chapter start at 1 and not at the number from the chapter before + 1. I don't want to change this, that way it's just easier to work with. So when someone executes the build script, pandoc throws a bunch of warnings about duplicate footnotes. The actual output is fine, because pandoc is that smart to fix these footnotes.

But the warnings are still here. They could confuse users and I don't see an option to disable warnings, but in my opinion this is an important feature. At least for me. :D

So yeah, it would be cool, if you could add this feature to your todo-list. :)

nnmrts on 17 Jun 2017

My guess is that the output is not fine; you'll be
getting footnotes, but not the right footnotes, since
pandoc will use only one of the footnotes labeled "1".
Have you checked carefully?

You could try using --file-scope (see the manual).

+++ Nano Miratus [Jun 16 17 15:59 ]:

Hi! I hope this is related enough.

I'm currently working on an open source book and it has a build script
in it's directory, so users don't have to directly type in the pandoc
commands. But the source code is structured like this: Every chapter
has it's own markdown file and they get converted to one big markdown
file with pandoc in this build script. However, the footnotes in every
chapter start at 1 and not at the number from the chapter before + 1. I
don't want to change this, that way it's just easier to work with. So
when someone executes the build script, pandoc throws a bunch of
warnings about duplicate footnotes. The actual output is fine, because
pandoc is that smart to fix these footnotes.

But the warnings are still here. They could confuse users and I don't
see an option to disable warnings, but in my opinion this is an
important feature. At least for me. :D

So yeah, it would be cool, if you could add this feature to your
todo-list. :)

—
You are receiving this because you commented.
Reply to this email directly, [1]view it on GitHub, or [2]mute the
thread.

References

https://github.com/jgm/pandoc/issues/3392#issuecomment-309157914

https://github.com/notifications/unsubscribe-auth/AAAL5KqJveo3mtGxy_D5196CopV8kJ5Zks5sEwjCgaJpZM4LwveC

jgm on 17 Jun 2017

👍1

I have checked carefully, the output is totally fine. And all three chapters have identical footnotes. See, it works perfectly, just the warnings could confuse people.

https://github.com/nnmrts/dafern/tree/master/src - these are the source markdown files
https://github.com/nnmrts/dafern/tree/master/build - these are the built files (html, md and pdf)
https://github.com/nnmrts/dafern/blob/master/build.ps1 - this is the build script

The script is spaghetti code, I know. :D

But the command is basically:

pandoc metadata.md chapter1.md chapter2.md chapter3.md -o book.md

The only relevant settings are --atx-headers --wrap=none --preserve-tabs, but I don't think they make a change.

And this already works. The footnotes are correct and then I just convert the book.md to html and pdf and I'm done.

nnmrts on 30 Jun 2017

@nnmrts As I see it, the footnotes are not fine. I checked the PDF version and clicked on the 1st footnote in of 1st chapter at

Ich würde mich trotzdem noch darüber beschweren1

and was sent to the footnote of chapter 3

1Das war halt auch einfach nicht so geil.

Maybe you should check this again.

(Übrigens: Spannende Unternehmung, Dein Buch)

Wolf-at-SO on 30 Jun 2017

👍1

If you just want to turn off warnings, you can use
--quiet.

jgm on 30 Jun 2017

👍1

@Wolf-at-SO Also ist anscheinend nur das PDF kaputt. Okay, danke, das ist mir tatsächlich nicht aufgefallen, weil ich kaum auf die Fußnoten draufgeklickt hatte. ~Umso interessanter, dass die Markdown-Datei funktioniert.~ Das HTML ist auch kaputt, sehe ich gerade, obwohl ich schwören könnte, dass ich das schon mal genau mit dem Build-Prozess hinbekommen habe. Das ist weird. Naja. (Danke!)

_translation:
Oh, okay, the pdf is not fine. Thanks, and sorry. I've never recognized it, because I rarely clicked on the footnotes. ~Interesting all the more, considering the markdown file is fine.~ The html file isn't fine too, but if I remember rightly, I already got it to work with the same build script. Well...weird.

@jgm Thank you very much, this will probably help me in the future. But as it seems, I need the warnings now even more than before, until I get my build script to work. :D

So yeah, sorry, I should have checked the other files more carefully. Thanks for the help anyway! :)

EDIT: So, locally my files are great, on github they are all not fine, not even the markdown file. The markdown file I have locally is working, but I haven't changed it since my last commit, so...
I have some bigger issues here...

nnmrts on 30 Jun 2017

So I fixed it now, using a version-like notation, like [^1.1] in chapter one, or [^3.4] in chapter three. Output is like expected, with incremental and not per-chapter footnotes. Awesome, didn't know that this can be so easy.

Thank you two again! 💓

nnmrts on 30 Jun 2017

Well, there are still lots more things we could warn about.
But I'm going to close this now, since we have a framework in place which can be incrementally improved.

jgm on 9 Aug 2017

❤1

Was this page helpful?

0 / 5 - 0 ratings