Refined-github: url regex is replacing & with &; breaking urls

Created on 13 Nov 2017  路  6Comments  路  Source: sindresorhus/refined-github

Similar to #803
I haven't looked at the code, but see this url in the review section:
https://github.com/saglacio/saglac.io/pull/143/files/e92bf806d0a0757da2b16623172fa3041e65db36#diff-aeb42283af8ef8e9da40ededd3ae2ab2R23

And the same url not being transformed to a js-linkable-line-number in the conversation:
https://github.com/saglacio/saglac.io/pull/143#discussion-diff-150555564L22

Original (screenshot):

Bug (screenshot, notice the missing amp at the end of the link):

Live test:

url: https://saglac.us17.list-manage.com/subscribe/post?u=c9f3c71f1370a7e550caa409b&id=f45e751233
bug help wanted

Most helpful comment

For a moment I thought this was an enhancement, but it really is a bug. Not only the URL is linkified incorrectly (oh well...), but the code is also changed (which should never happen, amp should not be missing)

This bug can be fixed here: https://github.com/sindresorhus/linkify-urls/issues/14

All 6 comments

This is gonna be interesting.

  • I don't think that generally ; is an accepted character (non-url-encoded) in URLs.
  • In this case the URL isn't even technically valid since it's escaped HTML so even if detected correctly it won't work unless we also unescape it.
  • We don't know if a URL is escaped and what kind of escaping is applied to it. Perhaps & is common enough to grant its own simple detection+unescaping though.

@bfred-it I can confirm this point:

I don't think that generally ; is an accepted character (non-url-encoded) in URLs.

; is not an accepted character in URLs without being escaped. We've checked RFCs in order to use it in some CSV imports containing multiple URLs with a separator. We are using ; as a separator instead of just comma , for example.

Not as easy as it seems, this is the closest I got:

- const linkified = fn(textNode.textContent, options);
+ const linkified = fn(unescape(textNode.textContent), Object.assign({
+   value: escape(textNode.textContent) 
+   // wrong, it should just be the URL, not the whole string. Related: https://github.com/sindresorhus/linkify-urls/issues/10
+ }, options));
  1. It should unescape the URL so it can be matched by linkify-urls
  2. It should preserve the original unescaped URL as was because this is code.

I can't think of a way to both unescape the URL so it can be parsed by linkify-urls and also preserve the original URL (how do you match it out of textNode.textContent?)

In short, to solve this problem:

  • linkify-urls should either accept a custom URL regex, or 馃槦
  • it should detect escaped URLs internally, or 馃槼
  • we can't use linkify-url. 馃槦

@sindresorhus?

Why are we still discussing this if we all agree it's an invalid URL?

Perhaps & is common enough to grant its own simple detection+unescaping though.

It's not a valid URL until you unescape it. It's common to find them in HTML documents.

For a moment I thought this was an enhancement, but it really is a bug. Not only the URL is linkified incorrectly (oh well...), but the code is also changed (which should never happen, amp should not be missing)

This bug can be fixed here: https://github.com/sindresorhus/linkify-urls/issues/14

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hkdobrev picture hkdobrev  路  3Comments

durka picture durka  路  3Comments

alexanderadam picture alexanderadam  路  3Comments

fregante picture fregante  路  3Comments

supremebeing7 picture supremebeing7  路  3Comments