Pandoc: Source mapping in AST

Created on 19 Apr 2018 · 20Comments · Source: jgm/pandoc

Hi!

First of all, thank you for your great tool!

Being flexible and modular, it sometimes can be used in non-dedicated manner. For example, AST filters can be used to serialize AST somewhere, which allows just using Pandoc as markup language parser where needed.

For this purpose, is it difficult to add source file coordinates to AST or, maybe, to separate sourcemap-like file? For some formats, like M$ Word, they (coordinates within XML) will not be very informative, but for Human-readable formats it can be really great feature.

AST change enhancement Markdown reader

Source

dluciv

👍8

Most helpful comment

A couple problems for implementing this currently:

You request byte offset. Pandoc has access to source line and column, but not byte offset (which of course depends on the encoding). Of course, byte offset could be computed if we had source line and column. Character offset might be more useful though in general.
Pandoc's current markdown parsing strategy doesn't always allow us to give accurate source positions. For example, when we parse a block quote, we strip off the initial > or > from each line, then reparse the result. Since we may have stripped off different numbers of characters from each line (including 0 characters with "lazy" continuations), we've lost the information we need for accurate source positions. This can be fixed when we integrate commonmark-hs later, since it gives 100% accurate source positions.

jgm on 28 Apr 2019

👍3

All 20 comments

I'm sure there is a pandoc issue about this somewhere, but all I could find was https://github.com/atom-community/markdown-preview-plus/issues/106 and https://github.com/commonmark/CommonMark/issues/57

mb21 on 19 Apr 2018

Discussion in Google group https://groups.google.com/forum/#!msg/pandoc-discuss/tVe1RapDN5U/YGJE76FB74QJ

Does not look very optimistic though... =)

dluciv on 19 Apr 2018

Adding source locations for everything would require a lot of
fundamental changes in the readers, and in the AST, so it's not at all a
trivial change.

Adding source locations for block-level elements would be a bit
easier; they could all be wrapped in Divs with source position
attributes. Not sure if this would be useful, though.

The pure Haskell commonmark parser I'm working on (jgm/commonmark-hs),
which may some day be the basis of pandoc's Markdown parser, tracks
detailed source positions for everything.

jgm on 19 Apr 2018

Adding source locations for block-level elements would be a bit
easier; they could all be wrapped in Divs with source position
attributes. Not sure if this would be useful, though.

It would be useful. Example: editor and preview position synchronization for two-pane editors a-la https://dillinger.io/ -- any source position information would be useful for that.

lierdakil on 4 May 2018

👍3

In the context of adding attributes to all elements, I think mpickering somewhere suggested to use polymorphic types like Codeblock a String. I guess then you'd be freer to store source map as any data type you want inside that a...

mb21 on 17 May 2018

Hmmmm, not sure how much use the polymorphic types would see, my guess is they would only be used for a handful of things which would mean it's easier to just have some more concrete types. But I would be for having attributes attached to more of the AST. Also would like to see this done as an incremental change, so perhaps something like (1) add attributes for all Block elements as initial step, then (2) add extension like #4659 which only adds source attributes in a few places when the filter is turned on, then (3) as people request source attributes for more places in the AST and for more readers, then go through and make it happen.

I think though, it would be nice to turn on source attributes "globally" in some sense, without having to worry about "is it Markdown sources? which blocks do we do it for?". To that end, I can think of a few options:

Add a new Block type called Comment. Probably somewhat orthogonal, but then people can put whatever directives they want in the comments, and it helps with literate programming (so that we can keep source text as well, which is my original use-case).
Add a new Block type BlockLoc SourcePos Block, which just wraps a Block with the given source location information. Once again incrementally update the parsers to insert these when parsing blocks, but may be possible to do once per reader language at a Block level parser (instead of once per reader language per block constructor).

I don't know thought, not sure we should halt all progress on this to find the perfect solution. Might be better to incrementally work towards it with something that fits the use cases of those who speak up.

ehildenb on 17 May 2018

AST changes tend to be extremely painful for everyone, so generally :+1: for polymorphic types. "Incrementally" changing the AST doesn't sound like such a good idea.

lierdakil on 18 May 2018

Well the incremental part wouldn't be changing the AST, just incrementally changing the parsers to actually insert the source position attributes. The two changes I proposed to the ASTs could be taken or left, either way you'll have to change the parsers at some point. The ideal situation would be to have a single place where we can throw the "turn on source position attributes", but I don't think the current architecture of the Readers has that common entry point for all readers.

ehildenb on 18 May 2018

My point is, BlockLoc is obviously not the be-all and end-all solution (e.g. what do we do if we want source map attributes on inline elements?). So that would have to be changed at some point, which is a pain. Comment is not related to the discussion at all. If it was easy to paste source positions into the source Markdown, I'd do it ages ago via span/div -- the sad truth is one has to completely parse the source to do that.

lierdakil on 18 May 2018

Yeah, I guess those were just a couple random ideas, I certainly don't think they are the solution to go with immediately. Sorry that I distracted the conversation with them.

Basically, what I'm trying to get at is that perhaps we should go with a "good-enough" solution for the various use-cases people have asked about, and worry about a general solution as we see more use cases. To that end, if there are more changes that should be added to #4659 directly to satisfy more use-cases without performing a larger surgery on the reader framework/the ADTs, perhaps we should get them in now. It could very well be that people are happy with the "good-enough" for a while, giving more time for this discussion.

ehildenb on 18 May 2018

Just thought I would chime in with another use case. I'm interested in using some grammar checking tools on documents in different markup formats. Feeding in the markup directly generates a lot of spurious errors. It will really help to convert to plain text with pandoc and then use a source mapping to know where in the original text the issue occurred,

michaelmior on 13 Jun 2018

👍3

@michaelmior which constructs are you specifically interested in having source mapping for? Or just all constructs in the document?

ehildenb on 13 Jun 2018

I'd like to be able to take a line and column number from plain text output and map it back to a line and column number in the input. Since it's pretty well-adopted, it would be ideal if pandoc could have the option to print out a source map in the same format used by web browsers to map from compiled/minified CSS and JS to the original source.

michaelmior on 13 Jun 2018

👍3

This is a duplicate of https://github.com/jgm/pandoc/issues/3809 but I guess this issue here has more info :)

oli-obk on 8 Aug 2018

The following is copied from #5461, which was closed as a duplicate of this issue. I am including it here for the sake of discussion since it has some scenarios and an example implementation that should be considered in parallel with other suggestions here as this feature is actually developed.

The Scenario

I'm tired of using Regular Expressions to try to parse Markdown.

I run a publishing house where the canonical version of all our source material is stored in Markdown, and in the process of editing, translating, and publishing our materials there are a ton of operations that need to be done on the source files that require some level of contextual understanding. As our internal tooling is starting to look more and more like an IDE, more and more of our scripts are having to make guesses about lines in our source files. The most frustrating question to answer reliably is probably this one:

Is line N inside a blockquote? a list item? a fenced code block? a div? a header? some nested combination of blocks?

This question is surprisingly difficult to answer using general purpose stream editors and text manipulation tools. It isn't very hard to whip a RegEx that takes a wild guess, but it also isn't very hard to start running into edge cases where that guess is wildly off.

Inline context is much easier to parse with RegEx, but still error prone.

Is word Y inside emphasis markup? an inline span with a language attribute?

The Request

Implement an optional --source-location flag or similar that keeps track of byte offsets while reading input and adds this to the AST tree. This would probably include line, column, and overall input byte offset.

Consider input file test.md:

foo

> bar

I would want to return something like this:

$ pandoc --source-location -t json | jq
{
  "blocks": [
    {
      "t": "Para",
      "l": {
        "l": 1,
        "c": 1,
        "b": 1
      },
      "c": [
        {
          "t": "Str",
          "l": {
            "l": 1,
            "c": 1,
            "b": 1
          },
          "c": "foo"
        }
      ]
    },
    {
      "t": "BlockQuote",
      "l": {
        "l": 3,
        "c": 1,
        "b": 6
      },
      "c": [
        {
          "t": "Para",
          "l": {
            "l": 3,
            "c": 3,
            "b": 8
          },
          "c": [
            {
              "t": "Str",
              "l": {
                "l": 3,
                "c": 3,
                "b": 8
              },
              "c": "bar"
            }
          ]
        }
      ]
    }
  ],
  "pandoc-api-version": [
    1,
    17,
    5,
    4
  ],
  "meta": {}
}

What could this be used for?

My use case is for the Markdown reader, but I suspect any plain text input format would find benefits for this.

LaTeX→PDF has a tool callede SyncTeX that generates PDFs that can be used in a "preview" type output scenario and clicking on _any_ element can take you back to the line in the source TeX file. This is great boon for IDE's. I would like to do this with Markdown and other source document formats as well without re-inventing the wheel.
- I would like to build an EPUB reader on our website that people can report translation issues or even be linked to an online editor with the source markdown for that paragraph open to contribute a fix.
- I would like to print draft layouts that include a footer that shows the source file and line number range represented on each page so that proof readers can easily jump to the right context to make changes. Etc.
Syntax highlighters often struggle to get Pandoc's more advanced syntaxes sorted out and being able to check the context of a region by inspecting the AST tree would be a huge boon.
Linters could be implemented as pandoc filters, using Pandoc as the parser to get the context right, but returning output that can be used in the context on an editor.
Spell checkers
Natural language analysis tools (like grammar checkers or translation tools).
...
This would also solution for #1417.

alerque on 28 Apr 2019

A couple problems for implementing this currently:

You request byte offset. Pandoc has access to source line and column, but not byte offset (which of course depends on the encoding). Of course, byte offset could be computed if we had source line and column. Character offset might be more useful though in general.
Pandoc's current markdown parsing strategy doesn't always allow us to give accurate source positions. For example, when we parse a block quote, we strip off the initial > or > from each line, then reparse the result. Since we may have stripped off different numbers of characters from each line (including 0 characters with "lazy" continuations), we've lost the information we need for accurate source positions. This can be fixed when we integrate commonmark-hs later, since it gives 100% accurate source positions.

jgm on 28 Apr 2019

👍3

Thanks for the feedback @jgm.

A character offset would suffice. Given the encoding, the byte offset could be computed anyway; and realistically I'd make use of the character offset far more often and only suggested the byte offset so that could be computed.

But it is interesting you say the source line and column is accessible already. Even given the inaccuracies of the column value due to the strip/reparse cycle, even the line offset would help identify what kind of block nesting a given inline is wrapped in, no? Is there a way to output this? (I'm using patched versions of Pandoc anyway so don't mind bolting something on pending progress on ②.
I'm following commonmark-hs with considerable interest. I wish my Haskell chops weren't such weak sauce and I could help.

alerque on 29 Apr 2019

👍1

@jgm, @alerque For the needs of vim-pandoc-syntax even source line number would be a vast improvement.

fmoralesc on 16 May 2019

Pandoc's current markdown parsing strategy doesn't always allow us to give accurate source positions. For example, when we parse a block quote, we strip off the initial > or > from each line, then reparse the result. Since we may have stripped off different numbers of characters from each line (including 0 characters with "lazy" continuations), we've lost the information we need for accurate source positions.

@jgm: Perhaps the situation is more complicated than or different from what I understand in your comment. Based on what I do understand, I wonder whether you might just save the column index of the truncation for each line in the block before the recursive call to the parser. Then could you correct the results tree by applying the per-line column shift by the saved offset for that the particular line? More generally, could you not in other cases similarly save relevant information to reverse the effect of the transformation on the node information?

brainchild0 on 14 May 2020

Yes, it's not impossible.

However, I don't feel like overhauling the markdown reader to add this.
The commonmark parser in commonmark-hs, which I'll be integrating into pandoc, already has complete source position information.

jgm on 14 May 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings