K-9: Incorrect URL detection (final slash)

Created on 28 Mar 2016  路  10Comments  路  Source: k9mail/k-9

Expected behaviour

URLs automatically converted to links can include a final slash part of the URL.

Actual behaviour

If the URL contains a final slash, the link generated ignores it, breaking some links:

Content-Type: text/html;
 charset=utf-8
Content-Transfer-Encoding: 8bit

<a href="http://google.com">http://google.com</a>/<br>

Steps to reproduce

  1. Select HTML format in the settings
  2. Send a new mail with the following content: http://google.com

    Environment

K-9 Mail version: 5.008

Android version: 6.0

Account type (IMAP, POP3, WebDAV/Exchange): IMAP

bug good first issue help wanted

All 10 comments

A bug in the HtmlConverter. Confirmed and found exact behaviour using a unit test:

If a link is followed by a new line or a space the slash will be missed. Hence even though the second and third pass, the bug exists and is demonstrated by 4-6.

    @Test
    public void testLinks() {
        String text = "http://www.google.com";
        String result = HtmlConverter.textToHtml(text);
        assertEquals("<pre class=\"k9mail\"><a href=\"http://www.google.com\">http://www.google.com</a></pre>", result);

        text = "http://www.google.com/";
        result = HtmlConverter.textToHtml(text);
        assertEquals("<pre class=\"k9mail\"><a href=\"http://www.google.com/\">http://www.google.com/</a></pre>", result);

        text = "http://google.com/";
        result = HtmlConverter.textToHtml(text);
        assertEquals("<pre class=\"k9mail\"><a href=\"http://google.com/\">http://google.com/</a></pre>", result);

        text = "http://google.com/ ";
        result = HtmlConverter.textToHtml(text);
        assertEquals("<pre class=\"k9mail\"><a href=\"http://google.com/\">http://google.com/</a> </pre>", result);

        text = "http://google.com/ \n";
        result = HtmlConverter.textToHtml(text);
        assertEquals("<pre class=\"k9mail\"><a href=\"http://google.com/\">http://google.com/</a> <br></pre>", result);

        text = "http://google.com/\n";
        result = HtmlConverter.textToHtml(text);
        assertEquals("<pre class=\"k9mail\"><a href=\"http://google.com/\">http://google.com/</a><br></pre>", result);

    }

Unfortunately we're using a nasty regex to extract URIs:

    private static final Pattern WEB_URL_PATTERN = Pattern.compile(
                "((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)"
                + "\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_"
                + "\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?"
                + "((?:(?:[" + GOOD_IRI_CHAR + "][" + GOOD_IRI_CHAR + "\\-]{0,64}\\.)+"   // named host
                + TOP_LEVEL_DOMAIN_STR_FOR_WEB_URL
                + "|(?:(?:25[0-5]|2[0-4]" // or ip address
                + "[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]"
                + "|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1]"
                + "[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}"
                + "|[1-9][0-9]|[0-9])))"
                + "(?:\\:\\d{1,5})?)" // plus option port number
                + "(\\/(?:(?:[" + GOOD_IRI_CHAR + "\\;\\/\\?\\:\\@\\&\\=\\#\\~"  // plus option query params
                + "\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?"
                + "(?:\\b|$)"); // and finally, a word boundary or end of

The approach to fixing this must involve lots of unit tests to cover as many valid cases as possible to avoid breaking something.

Incidentally it also looks like it might break on the new gTLDs as well. That should also be looked at.

(Probably also the iccTLDs too..)

Things it also breaks on:

  • http://localhost
  • IPv6 addresses
  • Multiple query parameters (it incorrectly converts & to &)

Developer note: Instead of hard to maintain regular expressions I'd rather have proper parsers for URI detection. A framework that matches on the scheme identifier and then uses the appropriate parser to go from there could be relatively simple. That would allow to easily support a range of URIs (see e.g. PR #684).

OkHttp's HttpUrl tries hard to be compatible with current browsers and is probably a good place to borrow code for parsing http(s) URIs.

Would be https://developer.android.com/reference/android/util/Patterns.html#WEB_URL a possible solution or is that too inflexible?

I'm new to Open Source, and I'd like to work on this issue!

Is actively worked on this issue? Otherwise I would like to help out here.
@GoneUp Testing this does not seem to change anything.

So if I understood correctly a small universal construction is wanted with a corrected implementation for http?

Is this bug still open? I'd like to work on it. Are there other issues available which aren't marked as 'being worked upon'? So I may work on one of them to submit my first patch.

I'd like to work on this issue as well. A quick guide should suffice and I'll be on it. New to open source as well.

I'm new to open source, but I'd like to work on this issue, too, if it currently hasn't been patched yet.

There is already a promising PR for this, see #2392

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Immortalin picture Immortalin  路  3Comments

ByteHamster picture ByteHamster  路  3Comments

asbach2 picture asbach2  路  3Comments

jrtberlin picture jrtberlin  路  3Comments

Agno94 picture Agno94  路  3Comments