Marked: Reflink-like extension

Created on 3 Nov 2019 · 14Comments · Source: markedjs/marked

I'd like to add a footnotes reference tag ([^id]).

I assume it'd work like the link reference (reflink, in the source-code), but I don't really know how to extend the parser to add this custom tag.

How should I proceed to integrate a tag that is defined in two places, one building a reference body list and one replacing link tags with some HTML to point to the right reference?

NFE - new feature (should be an extension)

Source

Arteneko

Most helpful comment

Here is some faster code (~30% more ops/sec) that uses Marked's renderer option. Just place this before parsing any markdown and after importing.

Warning Unlike the above code, this does not guarantee a footnote exists programmatically per reference.

const footnoteMatch = /^\[\^([^\]]+)\]:([\s\S]*)$/;
const referenceMatch = /\[\^([^\]]+)\](?!\()/g;
const referencePrefix = "marked-fnref";
const footnotePrefix = "marked-fn";
const footnoteTemplate = (ref, text) => {
  return `<sup id="${footnotePrefix}:${ref}">${ref}</sup>${text}`;
};
const referenceTemplate = ref => {
  return `<sup id="${referencePrefix}:${ref}"><a href="#${footnotePrefix}:${ref}">${ref}</a></sup>`;
};
const interpolateReferences = (text) => {
  return text.replace(referenceMatch, (_, ref) => {
    return referenceTemplate(ref);
  });
}
const interpolateFootnotes = (text) => {
  return text.replace(footnoteMatch, (_, value, text) => {
    return footnoteTemplate(value, text);
  });
}
const renderer = {
  paragraph(text) {
    return marked.Renderer.prototype.paragraph.apply(null, [
      interpolateReferences(interpolateFootnotes(text))
    ]);
  },
  text(text) {
    return marked.Renderer.prototype.text.apply(null, [
      interpolateReferences(interpolateFootnotes(text))
    ]);
  }
};
marked.use({ renderer });

If you want to parse footnotes in other locations, just use the following template and place this in the renderer object.

  [token_type](text) {
    return marked.Renderer.prototype[token_type].apply(null, [
      interpolateReferences(interpolateFootnotes(text))
    ]);
  }

jun-sheaf on 12 Jun 2020

❤5

All 14 comments

Can you provide some markdown and resulting html to illustrate what you are looking for?

UziTech on 3 Nov 2019

This is an example paragraph with a reference[^ref].

[...] below

[ˆref]: This is the cite reference that will be listed at bottom of the article.

Would either expose a list of {id, text} objects, as to let the user put those objects and end of page, or generate the following example HTML.

<p>This is an example paragraph with a reference<sup id="backref:ref"><a href="#ref:ref">2</a></sup></p>.

<!-- [...] -->

<hr />

<ul>
  <li id="ref:ref">This is the cite reference that will be listed at bottom of the article.
 <a href="#backref:ref">&larrhk;</a></li>
</ul>

Providing a stray [^id]: text value will just ignore it, as it isn't referenced anywhere.
Providing a stray [ˆid] reference will just ignore it, as it isn't defined anywhere.

Edit: The goal is not to integrate such a feature into marked, but rather to ask how would be the best way to integrate such a feature (including the complexity of references) into the parsing pipeline.

Arteneko on 4 Nov 2019

There are three ways You can change the output of marked:

Convert the markdown to html before sending it to marked:

const marked = require('marked');
const markdown = '...';

const markdownWithRefs = convertRefsToHTML(markdown);

const html = marked(markdownWithRefs);

Change tokens before sending to the parser:

const marked = require('marked');
const markdown = '...';

const tokens = marked.lexer(markdown);

const tokensWithRefs = addRefTokens(tokens);

const html = marked.parser(tokensWithRefs);

Change html after sending the markdown to marked:

const marked = require('marked');
const markdown = '...';

const html = marked(markdown);

const htmlWithRefs = convertRefsToHTML(html);

It would probably be better to combine these approaches and do some preprocessing of the markdown (like parsing and removing the footnotes) before sending it to marked then convert the references to links in the tokens and adding the footnotes back after marked is done rendering the rest.

UziTech on 4 Nov 2019

👍1

I managed to make a custom method in front of Marked, that probably can be improved.

I decided to use the following format:

An <id> is of format a-zA-Z0-9_-
A reference link is of format [^<id>]
A reference block is of format ^<id>: <text>
A reference block cannot be multi-line

I originally wanted to be able to handle multi-line blocks, but that would require to consider a paragraph as a single reference block, i.e. to consider two successive lines with a reference block as a single line.

Basically, to say:

^ref: Test test
^ref2: Another test

=>
^ref: "Test test ^ref2: Another test"

Do you think it makes sense?

function marked(text) {
    // Reference definition
    const refblockRe = /^\^([\w\-]+): (.+)$/gm;
    // Reference link
    const reflinkRe = /\[\^([\w\-]+)\]/g;
    // Defined references
    const refs = [];
    // New token list (stripped of reference blocks)
    const editedToks = [];

    // Lexing to remove paragraph-level blocks
    const toks = _marked.lexer(text);
    editedToks.links = toks.links;
    for (const tok of toks) {
        if (tok.type !== 'paragraph'
            || !tok.text.match(refblockRe)) {
            editedToks.push(tok);
            continue;
        }

        let matches;
        while ((matches = refblockRe.exec(tok.text)) !== null) {
            refs.push({
                id: refs.length + 1,
                selector: matches[1],
                paragraph: _marked(matches[2]),
            });
        }
    }

    let parsedHtml = _marked.parser(editedToks);

    const errors = {
        refToUndefinedSelector: [], // Reference link to undefined block
        unusedSelector: [], // Reference block that is never linked
    };
    // Every block that is defined, then used.
    const usedSelectors = [];
    // Every reference link that should be transformed or removed in the HTML
    const reflinkTransformations = [];
    // Parse and replace reflinks
    let match;
    while ((match = reflinkRe.exec(parsedHtml)) !== null) {
        const selector = match[1];
        const ref = refs.find(ref => ref.selector === selector);

        // If reference to undefined selector
        // (no blockref for corresponding selector)
        if (!ref) {
            errors.refToUndefinedSelector.push(selector);
            reflinkTransformations.push({
                mode: 'delete',
                startIndex: match.index,
                length: match.input.length,
            });

            continue;
        }

        usedSelectors.push(selector);
        reflinkTransformations.push({
            mode: 'replace',
            startIndex: match.index,
            length: match[0].length,
            id: ref.id,
            selector: ref.selector,
        });
    }

    // Check for unused selectors
    errors.unusedSelector = refs
        .filter(ref => !usedSelectors.filter(selector => selector === ref.selector));

    // Inverse-order browse to apply transformations without breaking indexes
    for (const transformation of reflinkTransformations.sort((a, b) => b.id - a.id)) {
        const {id, selector} = transformation;
        const replacementValue = transformation.mode === 'replace'
            ? `<sup id="backref:${selector}"><a href="#ref:${selector}">${id}</a></sup>`
            : '';
        const before = parsedHtml.slice(0, transformation.startIndex);
        const after = parsedHtml.slice(transformation.startIndex + transformation.length, parsedHtml.length);
        parsedHtml = before + replacementValue + after;
    }

    return {
        references: refs,
        html: parsedHtml,
        errors,
    };
}

Edit: I don't know if it is possible, but it'd be nice to be able to somehow "inject" custom routines, much like middlewares, into the compiler, to simplify extension.

Arteneko on 5 Nov 2019

It looks like there is a spec for footnotes at markdownguide.org that uses [^label] syntax.

It looks like you have the right idea with replacing paragraph tokens. I would also run the body of the footnotes through marked so you can use markdown in the footnotes.

it'd be nice to be able to somehow "inject" custom routines, much like middlewares, into the compiler, to simplify extension.

We have talked about adding some sort of marked.use(marked-extension) method to allow extensions to hook into the process but there isn't a PR to implement that yet.

If you want to create a PR I would be happy to review it. 😁 👍

UziTech on 5 Nov 2019

The issue with the markdownguide.org spec is that it's interpreted as a link, something I didn't like.

I would also run the body of the footnotes through marked so you can use markdown in the footnotes.

That is already done, see.

            refs.push({
                id: refs.length + 1,
                selector: matches[1],
                paragraph: _marked(matches[2]),
            });

If you want to create a PR I would be happy to review it.

I may look into that once I have a bit more time to myself.

Generators (two yields) or passing the lexer / parser would do for the extension? (should be discussed in another thread).

Arteneko on 5 Nov 2019

👍1

I tried to convert my code to instead use the [^id]: paragraph syntax as recommended by the markdownguide (which also was how I originally saw it).

That would mean that the lexer would trust those blocks as links.

Except that the current lexer implementation doesn't parse multi-word links (logical), so I'd need to internally change the lexer for that purpose.

Such a change would probably mean I'd add a references property to the lexer array, but that makes me think that there's something badly architectured: we have an array that has some plugged properties, which have nothing to do with the array itself.

IMHO, some breaking changes should ultimately be done:

The lexer should return an object comprised of the tokens list, the links list, and the optionally new reference list. Even without this new feature, this format would allow for easier and much cleaner extension.
The parser should take an object at least comprised of the tokens list, and the links list (basically, what is required).

For now, I stay with my single-line ^id: paragraph style, which is overall easier to handle, but I definitely think that there's something to change here, even if it isn't the datastructure itself.

Arteneko on 7 Nov 2019

Here is an implementation of footnotes that follows the spec.

marked.lexer breaks the markdown into block tokens (paragraphs, code blocks, etc) so it won't change anything inside a bracket to a link until it goes to the parser.

This code removes the footnotes from the block tokens, including multi-line footnotes, and changes the references to html before parsing the tokens. After parsing it adds the footnotes back to the html.

This code is in no way complete. There are probably edge cases that will fail but this should be a good start.

const marked = require('marked');

const markdown = `
Here's a simple footnote,[^1] and here's a longer one.[^bignote]

[^1]: This is the first footnote.

[^bignote]: Here's one with multiple paragraphs and code.

    Indent paragraphs to include them in the footnote.

    \`{ my code }\`

    Add as many paragraphs as you like.
`;

const footnotes = [];
const newTokens = [];
const footnoteTest = /^\[\^[^\]]+\]: /;
const footnoteMatch = /^\[\^([^\]]+)\]: ([\s\S]*)$/;
const referenceTest = /\[\^([^\]]+)\](?!\()/g;

// get block tokens
const tokens = marked.lexer(markdown);

// remove footnotes from tokens
for (let i = 0; i < tokens.length; i++) {
  if (tokens[i].type !== 'paragraph' || !footnoteTest.test(tokens[i].text)) {
    newTokens.push(tokens[i]);
    continue;
  }

  const match = tokens[i].text.match(footnoteMatch);
  const name = match[1].replace(/\W/g, '-');
  let note = match[2];

  // multiline notes will be considered indented code blocks
  if (i + 2 < tokens.length && tokens[i + 2].type === 'code' && tokens[i + 2].codeBlockStyle === 'indented') {
    note += '\n\n' + tokens[i + 2].text;
    i += 2;
  }

  footnotes.push({
    name,
    note: `${marked(note)} <a href="#fnref:${name}">↩</a>`
  });
}

// change references to superset links
for (let i = 0; i < newTokens.length; i++) {
  if (newTokens[i].type === 'paragraph' || newTokens[i].type === 'text') {
    newTokens[i].text = newTokens[i].text.replace(referenceTest, (ref, value) => {
      const name = value.replace(/\W/g, '-');
      let code = ref;
      for (let j = 0; j < footnotes.length; j++) {
        if (footnotes[j].name === name) {
          code = `<sup id="fnref:${name}"><a href="#fn:${name}">${j + 1}</a></sup>`;
          break;
        }
      }
      return code;
    });
  }
}

newTokens.links = tokens.links;

let html = marked.parser(newTokens);

// add footnotes back to html
if (footnotes.length > 0) {
  html += `
<hr />
<ol>
  <li>${footnotes.map(f => f.note).join('</li>\n  <li>')}</li>
</ol>
`;
}

console.log(html);

UziTech on 7 Nov 2019

👍4

Hi @UziTech, thanks for providing this solution, I just tried and it doesn't work for me using the latest version.

Is that possibly related to the token changes?

cyanzhong on 22 May 2020

Yes, the tokens returned by marked.lexer changed in v1.0.0 so they are in a tree instead of in an array. You can use marked.walkTokens instead of the for loop to iterate over all of the tokens.

UziTech on 22 May 2020

❤1

@cyanzhong @UziTech I updated that code to work with the newer token structure.

const marked = require('marked');

const markdown = `
Here's a simple footnote,[^1] and here's a longer one.[^bignote]

[^1]: This is the first footnote.

[^bignote]: Here's one with multiple paragraphs and code.
    \`my code\`
    Indent paragraphs to include them in the footnote.
    Add as many paragraphs as you like.
`;

const footnotes = [];
const newTokens = [];
const footnoteTest = /^\[\^[^\]]+\]: /;
const footnoteMatch = /^\[\^([^\]]+)\]: ([\s\S]*)$/;
const referenceTest = /\[\^([^\]]+)\](?!\()/g;

// get block tokens
const tokens = marked.lexer(markdown);

// Check footnote
function checkFootnote (token) {
    if (token.type !== 'paragraph' || !footnoteTest.test(token.text)) {
      return;
    }

    const match = token.text.match(footnoteMatch);
    const name = match[1].replace(/\W/g, '-');
    let note = match[2];

    footnotes.push({
        name,
        note: `${marked(note)} <a href="#fnref:${name}">↩</a>`
    });

    // remove footnotes from tokens
    token.toDelete = true;

};

function checkReference(token)
{
    if( token.type === 'paragraph' || token.type === 'text' )
    {
        token.text = token.text.replace(referenceTest, (ref, value) => {
            const name = value.replace(/\W/g, '-');
            let code = ref;

            for (let j = 0; j < footnotes.length; j++) {
                if (footnotes[j].name === name) {
                    code = `<sup id="fnref:${name}"><a href="#fn:${name}">${j + 1}</a></sup>`;
                    break;
                }
            }
            return code;
        });

        if( token.type === 'paragraph')
        {
            // Override children
            token.tokens = marked.lexer(token.text)[0].tokens;
        }
    }
}

function visit (tokens, fn)
{
    for( var token of tokens )
    {
        fn( token );
        // Visit children
        if( token.tokens )
        {
            visit( token.tokens, fn)
        }
    }
}

visit( tokens, (token) => { checkFootnote(token); });


// Remove tokens from AST, starting with top-level
let workList = [ tokens ];
do {
    let tokenList = workList.pop();

    for(var i = tokenList.length-1; i >= 0 ; i--){
        if(tokenList[i].toDelete){
            tokenList.splice(i, 1);
        }
        else if( tokenList[i].tokens )
        {
            workList.push( tokenList[i].tokens );
        }
    }

} while( workList.length != 0 )


visit( tokens, (token) => { checkReference(token); });


let html = marked.parser(tokens);

if (footnotes.length > 0) 
{
  html += `
  <hr />
  <ol>
    <li>${footnotes.map(f => f.note).join('</li>\n  <li>')}</li>
  </ol>
  `;
}


console.log(html);

This is the output:

```html

Here's a simple footnote,¹ and here's a longer one.²

This is the first footnote.

↩

Here's one with multiple paragraphs and code.
my code
Indent paragraphs to include them in the footnote.
Add as many paragraphs as you like.

↩

```

chrisparnin on 12 Jun 2020

❤1

Here is some faster code (~30% more ops/sec) that uses Marked's renderer option. Just place this before parsing any markdown and after importing.

Warning Unlike the above code, this does not guarantee a footnote exists programmatically per reference.

const footnoteMatch = /^\[\^([^\]]+)\]:([\s\S]*)$/;
const referenceMatch = /\[\^([^\]]+)\](?!\()/g;
const referencePrefix = "marked-fnref";
const footnotePrefix = "marked-fn";
const footnoteTemplate = (ref, text) => {
  return `<sup id="${footnotePrefix}:${ref}">${ref}</sup>${text}`;
};
const referenceTemplate = ref => {
  return `<sup id="${referencePrefix}:${ref}"><a href="#${footnotePrefix}:${ref}">${ref}</a></sup>`;
};
const interpolateReferences = (text) => {
  return text.replace(referenceMatch, (_, ref) => {
    return referenceTemplate(ref);
  });
}
const interpolateFootnotes = (text) => {
  return text.replace(footnoteMatch, (_, value, text) => {
    return footnoteTemplate(value, text);
  });
}
const renderer = {
  paragraph(text) {
    return marked.Renderer.prototype.paragraph.apply(null, [
      interpolateReferences(interpolateFootnotes(text))
    ]);
  },
  text(text) {
    return marked.Renderer.prototype.text.apply(null, [
      interpolateReferences(interpolateFootnotes(text))
    ]);
  }
};
marked.use({ renderer });

If you want to parse footnotes in other locations, just use the following template and place this in the renderer object.

  [token_type](text) {
    return marked.Renderer.prototype[token_type].apply(null, [
      interpolateReferences(interpolateFootnotes(text))
    ]);
  }

jun-sheaf on 12 Jun 2020

❤5

@jun-sheaf Thanks for your solution! It works great!
I find out that the footnote has to contain a space, which means it cannot be a single word, otherwise it won't work in that case. But I don't have a clue.

Reedo0910 on 24 Jul 2020

Thanks a lot @jun-sheaf , can also confirm this works like a charm. Here's a version that will additionally add a section "References" (styleable with css class "marked-footnotes" see "footnoteContainerTemplate" below) around the footnotes on the bottom (I'm used to other markdown implementations doing this).

const footnoteMatch = /^\[\^([^\]]+)\]:([\s\S]*)$/;
const referenceMatch = /\[\^([^\]]+)\](?!\()/g;
const referencePrefix = "marked-fnref";
const footnotePrefix = "marked-fn";
const footnoteTemplate = (ref, text) => {
  return `<sup id="${footnotePrefix}:${ref}">${ref}</sup>${text}`;
};
const footnoteContainerTemplate = (text) => {
  return `<div class="marked-footnotes"><h2>References</h2>${text}</div>`
}
const referenceTemplate = ref => {
  return `<sup id="${referencePrefix}:${ref}"><a href="#${footnotePrefix}:${ref}">${ref}</a></sup>`;
};
const interpolateReferences = (text) => {
  return text.replace(referenceMatch, (_, ref) => {
    return referenceTemplate(ref);
  });
}
const interpolateFootnotes = (text) => {
  const found = text.match(footnoteMatch)
  if (found) {
    const replacedText = text.replace(footnoteMatch, (_, value, text) => {
        return footnoteTemplate(value, text);
    });
    return footnoteContainerTemplate(replacedText)
  }
  return text
}

const renderer = {
  paragraph(text) {
    return marked.Renderer.prototype.paragraph.apply(null, [
      interpolateReferences(interpolateFootnotes(text))
    ]);
  },
  text(text) {
    return marked.Renderer.prototype.text.apply(null, [
      interpolateReferences(interpolateFootnotes(text))
    ]);
  }
};
marked.use({ renderer });