Marked: Marked removes non-breaking spaces in the original text

Created on 9 Mar 2014 · 31Comments · Source: markedjs/marked

Consider, say, a text like this:
2 125 euro
I've put a non-breaking space between 2 and 125 so that it would always end up on the same line.

Marked pre-parses the text and completely removes the original non-breaking characters that I've put there:

Lexer.prototype.lex = function(src) {
  src = src
    .replace(/\r\n|\r/g, '\n')
    .replace(/\t/g, '    ')
    .replace(/\u00a0/g, ' ')
    .replace(/\u2424/g, '\n');
  return this.token(src, true);
};

This is where the devil hides: .replace(/\u00a0/g, ' ')

Here is more on why invisible non-breaking space characters are cool: http://destroytoday.com/findings/fix-widows-with-non-breaking-spaces/

has PR proposal

Source

arturi

👍3

Most helpful comment

I’m working on a Markdown-based presentation tool, and I’m using marked to generate HTML.

Having control over when and where text wraps is vital in a good presentation. Currently, the only way I can do that with marked is by overriding the lexer with a custom one that does the same things as the original one, except for the NBSP replacement. This is of course far from future-proof: In case the original lexer changes, I have to adapt my code.

Therefore I’m very much in favor of making this configurable. If you’re interested in a PR, let us know. And although I think that _not_ replacing the NBSPs is the “right” thing to do, I can understand that you don’t want to break existing code that relies on marked fixing the browser behavior. So, I don’t care what the default for this option is, but please introduce one.

scy on 8 Apr 2014

👍2

All 31 comments

+1, if anything this should be an option, or configurable

christopherscott on 13 Mar 2014

Yes! I've recently lost a few hours of my life tracking this very same thing down.

daleconboy on 13 Mar 2014

@daleconboy, I'm sorry to hear that, but many people lost several hours of their lives trying to figure out why their spaces weren't getting processed correctly when text was passed in from the DOM (see #52 - cc @OscarGodson), which is why this was added in the first place.

I'll consider adding an option, but I want to keep their removal the default since more people probably get bit by this "feature" of contenteditable elements than not.

chjj on 13 Mar 2014

Hey, thanks for the response. I definitely sympathize with anyone who's been bitten by this quirk in any way, however I would argue against the default being wholesale replacement of non-breaking spaces.

Reason being, it's not a bug with marked, but rather a browser behavior which shouldn't be the responsibility of marked to manage. Technically the responsibility should fall on the developer who's using the contenteditable elements to be aware of the quirk and to manage the white space handling, or conversion, on their end.

The W3C working draft specifically calls this out to authors working with contenteditable elements:

http://www.w3.org/TR/html51/editing.html#best-practices-for-in-page-editors

Authors are encouraged to set the 'white-space' property on editing hosts and on markup that was originally created through these editing mechanisms to the value 'pre-wrap' …

It seems that with contenteditable regions expected to behave in this way, you would want to preserve their expected behavior by default to avoid confusion. This, in turn, would also avoid the confusion where devs are expecting their explicitly set non-breaking spaces to behave as expected.

And, since marked may also be used in a node environment where contenteditabe does not exist, this replacement behavior by default would be unexpected.

Bottom line, I appreciate you considering it as an option. How you decide to set the default behavior is of course up to you. Any option is definitely better than no option. I'll cast my vote for the default being no replacement. :)

Cheers!

daleconboy on 14 Mar 2014

@daleconboy's argument is pretty convincing. Are there other use cases for no-break spaces in markdown input? I would think a set of tests would help define the severity of the issue.

drscannell on 14 Mar 2014

@daleconboy I like your point about the browser, except, in @arturi's post he specifically points out that spaces are good to fix a browser bug haha :) Also, i wouldn't agree that it's a browser issue. Markdown's "spec" doesn't say which kind of spaces are and aren't allowed so IMO Marked, and any markdown parser, should assume all spaces (nbsp, unicode, etc) should be considered what they are: spaces. Your suggestion, unless im misunderstanding it, is wanted to specifically _ignore_ certain kinds of spaces.

OscarGodson on 14 Mar 2014

I’m working on a Markdown-based presentation tool, and I’m using marked to generate HTML.

scy on 8 Apr 2014

👍2

@scy suggestion is nice. Let me extend it with an example. It might be helpful for future readers...

import marked from 'marked';

// monkey patch for marked 0.3.3 to preserve non-breaking spaces
marked.Lexer.prototype.lex = function (src) {
  src = src
    .replace(/\r\n|\r/g, '\n')
    .replace(/\t/g, '    ')
    .replace(/\u2424/g, '\n');

  return this.token(src, true);
};

UPDATED 2017-05-11: fixed syntax

Lendar on 17 Jun 2015

Is anyone aware of an option or a work around for this issue?

There should definitely be an option to allow non-breaking spaces to pass to the HTML.

RichardForrester on 15 Feb 2016

I’ve solved it by extending lex, as shown here: https://github.com/chjj/marked/issues/363#issuecomment-112853706.

arturi on 15 Feb 2016

Hello everyone,

@Lendar 's suggestion is wonderful and if I edit this in the source code I can fix it this way. However putting it straight into my own code complains about 'this.token' not being a function. What is the best way to implement this?

Cheers!

deanvaessen on 29 Apr 2017

I imagine the problem, @deanvaessen, is the arrow function. Try replacing (src) => { ... } with function(src) { ... }. :)

davidchambers on 29 Apr 2017

❤1

Confirmed. Thank you @davidchambers :)

deanvaessen on 4 May 2017

@davidchambers @deanvaessen surprised it's still relevant. Updated the example in the comment ⬆️

Lendar on 11 May 2017

👍1

I am in the same boat as @scy — working on a Markdown-based presentation tool. I, too, want to control where lines break and where they never break. Please make an option that stops breaking non-breaking spaces.

As for browsers and/or WYSIWYG editors inserting non-breaking spaces where not expicitly requested by the user, that’s their bugs and should be fixed there.

yurikhan on 28 May 2017

Up ?
Sometimes, non-breaking spaces are needed by the language (e.g. in French, https://fr.wikipedia.org/wiki/Espace_ins%C3%A9cable, non-breaking spaces are necessary before '?', ':', '!', ';', thousand separators, phone numbers, and i also use then between quotes and where line-breaking should be avoided like brand names).

Thus, there is no reason for removing them (i would say non-breaking spaces should not be interpreted as syntax spaces).

PS: there is also no reason for anyone to monkey-patch marked. But it's a bit annoying to always work with minor-fix forks.

ArTiSTiX on 9 Oct 2017

Alright, let's use @Lendar monkey patch :) for replacing
https://github.com/chjj/marked/blob/6b0416d10910702f73da9cb6bb3d4c8dcb7dead7/lib/marked.js#L142-L150

oliviertassinari on 11 Feb 2018

Closing as having a fix or workaround as the Marked library proper figures its life out. :)

joshbruce on 11 Feb 2018

👎1

@joshbruce So it's a won't fix.

oliviertassinari on 11 Feb 2018

@oliviertassinari: At this juncture I'm siding with @chjj on this one (https://github.com/chjj/marked/issues/363#issuecomment-37497732). See #956 as well:

XSS fixes were the focus.
Fixing known issues and complying with CommonMark and GFM are next; so, it's on our radar (@Feder1co5oave and @UziTech) as the spec does see it as a part of mixed content: http://spec.commonmark.org/0.28/#example-302 - see also #958 (if requested to be reopened by the primary contributors at this time, it will be)

I guess what I'm saying is, right now we have bigger fish to fry and it seems like there is a viable workaround in the meantime. Does that help?

joshbruce on 11 Feb 2018

Note: This only applies to explicit   inclusion in the Markdown.

joshbruce on 11 Feb 2018

@joshbruce Thanks for the extra details. I wasn't sure what was the implication of the first answer.

oliviertassinari on 11 Feb 2018

👍1

@oliviertassinari: Fair. And sorry for not providing more - was in a rush going through issues. :)

joshbruce on 11 Feb 2018

What wrote Christopher about people getting bit by this is at least debatable.
I also came across an example in commonmark where a non-breaking space changed the interpretation behaviour because it usually isn't allowed to be used in place of a single space. So if we want to comply I think we need to at least consider this.
I never used one but I guess some people use it so I see no point in replacing them altogether with single spaces.
However, if we merge this we must require it to be tested properly.

Feder1co5oave on 11 Feb 2018

@Feder1co5oave: Reopen or no?

Again, I'm not sure if the original ticket was referring to the html encoded   or a unicode character discovery - not the same thing in my book. As a user, I would expect the   to be preserved, but not necessarily a special character injection of UTF-8 or something similar...am I wrong there. I concur that Chris's assessment is debatable.

Leave it to you, brother.

joshbruce on 11 Feb 2018

I'm pretty sure  s pass through without a problem. Whereas the
Unicode character is currently replaced by a single space. It seems it was
set this way because users somehow typed in unwanted non-breaking spaces,
but it seems to me this assumption is flimsy. Also you take away from
others the possibility to consciously use non-breaking spaces and I don't
like that. We certainly need to improve our Unicode support in general (per
commonmark), so I think this will change eventually. We need to make sure
everything works smoothly as usual.

Feder1co5oave on 11 Feb 2018

All right. Leaving closed for now. Flagging with newly minted #1048 for when we're ready to focus there. This could also explain the Chinese character problems with header ids, yeah?

joshbruce on 11 Feb 2018

Yes it's related to that and headings' ids

Feder1co5oave on 11 Feb 2018

Tagging as #1048

Feder1co5oave on 23 Feb 2018

👍1

Tagging #1043 as well, just because of the "header ids" comment.

joshbruce on 24 Feb 2018

related pr #897