I added this url https://pgp.mit.edu/pks/lookup?op=get&search=0x18231B0B449CC9D2 on my bio but the & has been replaced by \& so the url is not the good one : https://pgp.mit.edu/pks/lookup?op=get&search=0x18231B0B449CC9D2
You should know that this is the correct way to write a URL in HTML. Normally the browser understand it. On GitHub here it's & for the second URL.
So the bug is general to all sent URL, it double encodes &. Once for the text, once for the URL.
@Exagone313 1/ Sorry but I think that when a user copy/past a url, he attends that he doesn't have to rewrite the url.
2/ https://pgp.mit.edu/pks/lookup?op=get&search=0x18231B0B449CC9D2 is not directing me to the good url... and that's normal.
Check the source code of this page for your links and you'll understand what I mean.
The double HTML entity encoding is the issue. In particular, here is where it happens: https://github.com/tootsuite/mastodon/blob/master/app/lib/formatter.rb#L38
However, I am hesitant to mess with that code becose encoding HTML entities is required for preventing XSS attacks by users being able to post HTML.
You just have to use a regex to convert back the URL. Both the URL and the displayed text should be encoded once, so you just have to convert & to &, if I'm not wrong. Check for a < in a URL, after that. The problem should be about malformed URL like &<.
Where are non-local messages handled? It does not trust other instances about that, right?
Also, I think we should open a new issue about how to treat malformed URL with special characters. Because there are multiple possible parts in a URL: protocol, user, password, host (see xn-- form, and also: IPv4 and [IPv6]), port, path (space becomes %20), query string (space becomes +). Then we have the way to write URL in HTML, with that & that becomes & (with special characters, the URL displayed in the <a> node may not be the link-encoded form!)
Here it only allows https?:// protocols, could be blacklist-based (javascript:), while consider that some protocols like magnet don't use protocol://addr but protocol:addr.
The scope could be larger than only bio.
An example of problematic ampersand issue is at https://mastodon.social/@envlh/1644777
This is a legitimate URL to Wikidata, a side project of Wikipedia. Users expect to be able to use some characters in URLs without encode them to use ?foo=bar&quux=truequery string syntax in URLs.
The problem is the combination of encode and link_urls.
>> encoded = Formatter.instance.send :encode, 'https://pgp.mit.edu/pks/lookup?op=get&search'
=> "https://pgp.mit.edu/pks/lookup?op=get&search"
>> Formatter.instance.send :link_urls, encoded
=> "<a href=\"https://pgp.mit.edu/pks/lookup?op=get&amp;search\" rel=\"nofollow noopener\" target=\"_blank\"><span class=\"invisible\">https://</span><span class=\"ellipsis\">pgp.mit.edu/pks/lookup?op=get&</span><span class=\"invisible\">amp;amp;search</span></a>"
We can't simply remove encode because link_urls doesn't encode other HTML entities.
>> Formatter.instance.send :link_urls, ' a & b & <script>'
=> " a & b & <script>"
And actually link_urls isn't correctly encoding the display text of the <a> either:
>> Formatter.instance.send :link_urls, 'https://pgp.mit.edu/pks/lookup?op=get&search'
=> "<a href=\"https://pgp.mit.edu/pks/lookup?op=get&search\" rel=\"nofollow noopener\" target=\"_blank\"><span class=\"invisible\">https://</span><span class=\"ellipsis\">pgp.mit.edu/pks/lookup?op=get&</span><span class=\"invisible\">amp;search</span></a>"
I wonder what features we're relying on from Twitter::Autolink. Rails' auto_link seems to handle this correctly:
>> helper.auto_link "https://pgp.mit.edu/pks/lookup?op=get&search"
=> "<a href=\"https://pgp.mit.edu/pks/lookup?op=get&search\">https://pgp.mit.edu/pks/lookup?op=get&search</a>"
>> helper.auto_link '<a href="https://pgp.mit.edu/pks/lookup?op=get&search">foo</a>'
=> "<a href=\"https://pgp.mit.edu/pks/lookup?op=get&search\">foo</a>"
Now that https://github.com/tootsuite/mastodon/pull/2138 has been merged, this issue should be resolved, so I'm going to close it. @mart1oeil if you feel this hasn't been resolved to your satisfaction, please just let us know 馃憤