Consider, say, a text like this:
2 125 euro
I've put a non-breaking space between 2 and 125 so that it would always end up on the same line.
Marked pre-parses the text and completely removes the original non-breaking characters that I've put there:
Lexer.prototype.lex = function(src) {
src = src
.replace(/\r\n|\r/g, '\n')
.replace(/\t/g, ' ')
.replace(/\u00a0/g, ' ')
.replace(/\u2424/g, '\n');
return this.token(src, true);
};
This is where the devil hides: .replace(/\u00a0/g, ' ')
Here is more on why invisible non-breaking space characters are cool: http://destroytoday.com/findings/fix-widows-with-non-breaking-spaces/
+1, if anything this should be an option, or configurable
Yes! I've recently lost a few hours of my life tracking this very same thing down.
@daleconboy, I'm sorry to hear that, but many people lost several hours of their lives trying to figure out why their spaces weren't getting processed correctly when text was passed in from the DOM (see #52 - cc @OscarGodson), which is why this was added in the first place.
I'll consider adding an option, but I want to keep their removal the default since more people probably get bit by this "feature" of contenteditable elements than not.
Hey, thanks for the response. I definitely sympathize with anyone who's been bitten by this quirk in any way, however I would argue against the default being wholesale replacement of non-breaking spaces.
Reason being, it's not a bug with marked, but rather a browser behavior which shouldn't be the responsibility of marked to manage. Technically the responsibility should fall on the developer who's using the contenteditable elements to be aware of the quirk and to manage the white space handling, or conversion, on their end.
The W3C working draft specifically calls this out to authors working with contenteditable elements:
http://www.w3.org/TR/html51/editing.html#best-practices-for-in-page-editors
Authors are encouraged to set the 'white-space' property on editing hosts and on markup that was originally created through these editing mechanisms to the value 'pre-wrap' …
It seems that with contenteditable regions expected to behave in this way, you would want to preserve their expected behavior by default to avoid confusion. This, in turn, would also avoid the confusion where devs are expecting their explicitly set non-breaking spaces to behave as expected.
And, since marked may also be used in a node environment where contenteditabe does not exist, this replacement behavior by default would be unexpected.
Bottom line, I appreciate you considering it as an option. How you decide to set the default behavior is of course up to you. Any option is definitely better than no option. I'll cast my vote for the default being no replacement. :)
Cheers!
@daleconboy's argument is pretty convincing. Are there other use cases for no-break spaces in markdown input? I would think a set of tests would help define the severity of the issue.
@daleconboy I like your point about the browser, except, in @arturi's post he specifically points out that spaces are good to fix a browser bug haha :) Also, i wouldn't agree that it's a browser issue. Markdown's "spec" doesn't say which kind of spaces are and aren't allowed so IMO Marked, and any markdown parser, should assume all spaces (nbsp, unicode, etc) should be considered what they are: spaces. Your suggestion, unless im misunderstanding it, is wanted to specifically _ignore_ certain kinds of spaces.
I’m working on a Markdown-based presentation tool, and I’m using marked to generate HTML.
Having control over when and where text wraps is vital in a good presentation. Currently, the only way I can do that with marked is by overriding the lexer with a custom one that does the same things as the original one, except for the NBSP replacement. This is of course far from future-proof: In case the original lexer changes, I have to adapt my code.
Therefore I’m very much in favor of making this configurable. If you’re interested in a PR, let us know. And although I think that _not_ replacing the NBSPs is the “right” thing to do, I can understand that you don’t want to break existing code that relies on marked fixing the browser behavior. So, I don’t care what the default for this option is, but please introduce one.
@scy suggestion is nice. Let me extend it with an example. It might be helpful for future readers...
import marked from 'marked';
// monkey patch for marked 0.3.3 to preserve non-breaking spaces
marked.Lexer.prototype.lex = function (src) {
src = src
.replace(/\r\n|\r/g, '\n')
.replace(/\t/g, ' ')
.replace(/\u2424/g, '\n');
return this.token(src, true);
};
UPDATED 2017-05-11: fixed syntax
Is anyone aware of an option or a work around for this issue?
There should definitely be an option to allow non-breaking spaces to pass to the HTML.
I’ve solved it by extending lex, as shown here: https://github.com/chjj/marked/issues/363#issuecomment-112853706.
Hello everyone,
@Lendar 's suggestion is wonderful and if I edit this in the source code I can fix it this way. However putting it straight into my own code complains about 'this.token' not being a function. What is the best way to implement this?
Cheers!
I imagine the problem, @deanvaessen, is the arrow function. Try replacing (src) => { ... } with function(src) { ... }. :)
Confirmed. Thank you @davidchambers :)
@davidchambers @deanvaessen surprised it's still relevant. Updated the example in the comment ⬆️
I am in the same boat as @scy — working on a Markdown-based presentation tool. I, too, want to control where lines break and where they never break. Please make an option that stops breaking non-breaking spaces.
As for browsers and/or WYSIWYG editors inserting non-breaking spaces where not expicitly requested by the user, that’s their bugs and should be fixed there.
Up ?
Sometimes, non-breaking spaces are needed by the language (e.g. in French, https://fr.wikipedia.org/wiki/Espace_ins%C3%A9cable, non-breaking spaces are necessary before '?', ':', '!', ';', thousand separators, phone numbers, and i also use then between quotes and where line-breaking should be avoided like brand names).
Thus, there is no reason for removing them (i would say non-breaking spaces should not be interpreted as syntax spaces).
PS: there is also no reason for anyone to monkey-patch marked. But it's a bit annoying to always work with minor-fix forks.
Alright, let's use @Lendar monkey patch :) for replacing
https://github.com/chjj/marked/blob/6b0416d10910702f73da9cb6bb3d4c8dcb7dead7/lib/marked.js#L142-L150
Closing as having a fix or workaround as the Marked library proper figures its life out. :)
@joshbruce So it's a won't fix.
@oliviertassinari: At this juncture I'm siding with @chjj on this one (https://github.com/chjj/marked/issues/363#issuecomment-37497732). See #956 as well:
I guess what I'm saying is, right now we have bigger fish to fry and it seems like there is a viable workaround in the meantime. Does that help?
Note: This only applies to explicit inclusion in the Markdown.
@joshbruce Thanks for the extra details. I wasn't sure what was the implication of the first answer.
@oliviertassinari: Fair. And sorry for not providing more - was in a rush going through issues. :)
What wrote Christopher about people getting bit by this is at least debatable.
I also came across an example in commonmark where a non-breaking space changed the interpretation behaviour because it usually isn't allowed to be used in place of a single space. So if we want to comply I think we need to at least consider this.
I never used one but I guess some people use it so I see no point in replacing them altogether with single spaces.
However, if we merge this we must require it to be tested properly.
@Feder1co5oave: Reopen or no?
Again, I'm not sure if the original ticket was referring to the html encoded or a unicode character discovery - not the same thing in my book. As a user, I would expect the to be preserved, but not necessarily a special character injection of UTF-8 or something similar...am I wrong there. I concur that Chris's assessment is debatable.
Leave it to you, brother.
I'm pretty sure s pass through without a problem. Whereas the
Unicode character is currently replaced by a single space. It seems it was
set this way because users somehow typed in unwanted non-breaking spaces,
but it seems to me this assumption is flimsy. Also you take away from
others the possibility to consciously use non-breaking spaces and I don't
like that. We certainly need to improve our Unicode support in general (per
commonmark), so I think this will change eventually. We need to make sure
everything works smoothly as usual.
All right. Leaving closed for now. Flagging with newly minted #1048 for when we're ready to focus there. This could also explain the Chinese character problems with header ids, yeah?
Yes it's related to that and headings' ids
Tagging as #1048
Tagging #1043 as well, just because of the "header ids" comment.
related pr #897
Most helpful comment
I’m working on a Markdown-based presentation tool, and I’m using marked to generate HTML.
Having control over when and where text wraps is vital in a good presentation. Currently, the only way I can do that with marked is by overriding the lexer with a custom one that does the same things as the original one, except for the NBSP replacement. This is of course far from future-proof: In case the original lexer changes, I have to adapt my code.
Therefore I’m very much in favor of making this configurable. If you’re interested in a PR, let us know. And although I think that _not_ replacing the NBSPs is the “right” thing to do, I can understand that you don’t want to break existing code that relies on marked fixing the browser behavior. So, I don’t care what the default for this option is, but please introduce one.