URLs automatically converted to links can include a final slash part of the URL.
If the URL contains a final slash, the link generated ignores it, breaking some links:
Content-Type: text/html;
charset=utf-8
Content-Transfer-Encoding: 8bit
<a href="http://google.com">http://google.com</a>/<br>
http://google.comK-9 Mail version: 5.008
Android version: 6.0
Account type (IMAP, POP3, WebDAV/Exchange): IMAP
A bug in the HtmlConverter. Confirmed and found exact behaviour using a unit test:
If a link is followed by a new line or a space the slash will be missed. Hence even though the second and third pass, the bug exists and is demonstrated by 4-6.
@Test
public void testLinks() {
String text = "http://www.google.com";
String result = HtmlConverter.textToHtml(text);
assertEquals("<pre class=\"k9mail\"><a href=\"http://www.google.com\">http://www.google.com</a></pre>", result);
text = "http://www.google.com/";
result = HtmlConverter.textToHtml(text);
assertEquals("<pre class=\"k9mail\"><a href=\"http://www.google.com/\">http://www.google.com/</a></pre>", result);
text = "http://google.com/";
result = HtmlConverter.textToHtml(text);
assertEquals("<pre class=\"k9mail\"><a href=\"http://google.com/\">http://google.com/</a></pre>", result);
text = "http://google.com/ ";
result = HtmlConverter.textToHtml(text);
assertEquals("<pre class=\"k9mail\"><a href=\"http://google.com/\">http://google.com/</a> </pre>", result);
text = "http://google.com/ \n";
result = HtmlConverter.textToHtml(text);
assertEquals("<pre class=\"k9mail\"><a href=\"http://google.com/\">http://google.com/</a> <br></pre>", result);
text = "http://google.com/\n";
result = HtmlConverter.textToHtml(text);
assertEquals("<pre class=\"k9mail\"><a href=\"http://google.com/\">http://google.com/</a><br></pre>", result);
}
Unfortunately we're using a nasty regex to extract URIs:
private static final Pattern WEB_URL_PATTERN = Pattern.compile(
"((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)"
+ "\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_"
+ "\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?"
+ "((?:(?:[" + GOOD_IRI_CHAR + "][" + GOOD_IRI_CHAR + "\\-]{0,64}\\.)+" // named host
+ TOP_LEVEL_DOMAIN_STR_FOR_WEB_URL
+ "|(?:(?:25[0-5]|2[0-4]" // or ip address
+ "[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]"
+ "|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1]"
+ "[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}"
+ "|[1-9][0-9]|[0-9])))"
+ "(?:\\:\\d{1,5})?)" // plus option port number
+ "(\\/(?:(?:[" + GOOD_IRI_CHAR + "\\;\\/\\?\\:\\@\\&\\=\\#\\~" // plus option query params
+ "\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?"
+ "(?:\\b|$)"); // and finally, a word boundary or end of
The approach to fixing this must involve lots of unit tests to cover as many valid cases as possible to avoid breaking something.
Incidentally it also looks like it might break on the new gTLDs as well. That should also be looked at.
(Probably also the iccTLDs too..)
Things it also breaks on:
Developer note: Instead of hard to maintain regular expressions I'd rather have proper parsers for URI detection. A framework that matches on the scheme identifier and then uses the appropriate parser to go from there could be relatively simple. That would allow to easily support a range of URIs (see e.g. PR #684).
OkHttp's HttpUrl tries hard to be compatible with current browsers and is probably a good place to borrow code for parsing http(s) URIs.
Would be https://developer.android.com/reference/android/util/Patterns.html#WEB_URL a possible solution or is that too inflexible?
I'm new to Open Source, and I'd like to work on this issue!
Is actively worked on this issue? Otherwise I would like to help out here.
@GoneUp Testing this does not seem to change anything.
So if I understood correctly a small universal construction is wanted with a corrected implementation for http?
Is this bug still open? I'd like to work on it. Are there other issues available which aren't marked as 'being worked upon'? So I may work on one of them to submit my first patch.
I'd like to work on this issue as well. A quick guide should suffice and I'll be on it. New to open source as well.
I'm new to open source, but I'd like to work on this issue, too, if it currently hasn't been patched yet.
There is already a promising PR for this, see #2392