(Split from #4420; #4593 seems to be the culprit.)
OS: Windows 7 Ultimate (64-bit), Service Pack 1
qTox version: v1.13.0
Commit hash: 531defd0aa66af1b128f7293bc08718ce2cc064f
toxcore: v0.1.10
Qt: v5.9.3
When a URL contains certain characters (like parentheses or Unicode), qTox now recognizes it only up to the offending character.
Try these links, for example:
https://en.wikipedia.org/wiki/Seal_(East_Asia)
https://ja.wikipedia.org/wiki/鍗扮珷

Having the same problem with a pretty simple URL.
The issue here is pretty well explained by https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid
There is basically a very small set of all-ASCII characters which are "valid" URLs https://stackoverflow.com/a/1547940/3900868 as defined by RFC. This is what qTox is currently using to find URLs.
On the other hand, websites use many many characters outside of this encoding, i.e. any non-ASCII character, like https://ja.wikipedia.org/wiki/鍗扮珷, or even parentheses like https://en.wikipedia.org/wiki/Seal_(East_Asia). These characters are part of in-progress specs, and are sometimes referred to as IRIs and even though they're not "officially valid" they're still widely used.
For qTox to validate real world links, it should check IRIs instead of true URLs, like github is doing right now or browsers do.
When we vastly expand our matching, we should be careful about not over-matching. I've heard of attacks before where links looked exactly like website A but were interpreted by browsers to go to website B because of their non-equal characters.
When we vastly expand our matching, we should be careful about not over-matching. I've heard of attacks before where links looked exactly like website A but were interpreted by browsers to go to website B because of their non-equal characters.
related blog-post: https://www.xudongz.com/blog/2017/idn-phishing/
The solution is to use punycode: https://en.wikipedia.org/wiki/Punycode.
Example in python:
> u'https://www.邪褉褉訌械.com'.encode('idna') # fake apple.com domain
'https://www.xn--80ak6aa92e.com'
> u'https://www.apple.com'.encode('idna') # real apple.com domain
'https://www.apple.com'
Of course it will show the codes for all domains that have unicode characters in their names, so it's not perfect, but I don't think there is any other way. It's either do this or do nothing. Unless there is some better solution that I'm not aware of.
There is also a workaround used in Chrome (https://bugzilla.mozilla.org/show_bug.cgi?id=1332714#c17):
Chrome's fix is to collect all the Cyrillic letters in a label and then see if they are all in the set of 22 confusables. If they are and the TLD is ascii then they show punycode. If they find a Cyrillic letter outside that set then they let the normal IDN algorithm make the decision about allowed script mixing.
邪褉褉訌械.com would be punycode
邪褉褉訌械.ru would be punycode
邪褉褉訌械.褉褎 would be IDN
Advantages: Protects users from registries not doing their job; protects against sub-domain label spoofing where the registry has no say in any case.
Disadvantages: will uglify at least 2800 .com domains. Do we know how many are legit vs spoofing demonstrations like 邪褉褉訌械.com? More concerning are the unknown number in other ascii TLDs like .ru, .ua, etc. Given 22 letters to play with I would imagine a large number of legit Russian words fit in that set. It looks like some of those registries may only allow ascii domains on the ascii TLD and restrict the use of cyrillic to their cyrillic TLD (don't hold me to it--was skimming). On the the other hand the .eu registry definitely accepts cyrillic (Bulgaria is a member) so that could be a problem.
Of course it will show the codes for all domains that have unicode characters in their names, so it's not perfect, but I don't think there is any other way. It's either do this or do nothing. Unless there is some better solution that I'm not aware of.
How about show both (punycode in parenthesis) or show the punycode on mouseover?
That sounds like a good idea, but I'm worried that displaying punycodes anywhere in the UI will always be confusing to most users. This is the same problem that web browsers face: do we make a lot of urls ugly for a lot of users or leave everyone unprotected from this? Such setting could be added, but it should be optional and disabled by default.
I'm don't think we should try to offer any protection to this by default. It looks like we can't do anything to protect users from this and not harm them at the same time. Even web browsers can't do much. Firefox doesn't do this by default, but offers a setting to turn it on. Their opinion is that it should be the job of domain registrars to prevent people from registering those kind of domains. In Internet Explorer it depends on user's language settings (https://msdn.microsoft.com/en-us/library/dd565654(v=vs.85).aspx#example). In Chrome it looks like they convert to punycode only some domain names (but still many).
Whatever we decide, the detection of links should be fixed anyway. Users don't gain anything from urls being broken.
I agree mostly with @tox-user. If chrome and firefox don't have good solutions to this, we're not going to homebrew some regex to do better. Checking characters that look kind of similar between different languages in different fonts sounds like something we could maybe do to some degree if we pulled in an external lib that firefox or chrome use.
RE: IDNA encoding example:
>>> u'https://ja.wikipedia.org/wiki/鍗扮珷'.encode('idna')
'https://ja.wikipedia.xn--org/wiki/-tl7oq485a'
the original link is a valid page, but after IDNA encoding it isn't. So we might have to do something a little different to display the ascii-only option.
I think having a setting under security to display all links in ascii might be reasonable, but showing punycode to users by default would cause more confusion and worry than anything else. If we want to be easy to use and replace Skype, showing percent-encoded URLs for any non-ASCII alphabet domain doesn't seem like the right approach.
the original link is a valid page, but after IDNA encoding it isn't. So we might have to do something a little different to display the ascii-only option
Perhaps show the link as ASCII, but when it's clicked open the unicode version?
Sure, but still, a random non-technical Japanese user who sees https://ja.wikipedia.xn--org/wiki/-tl7oq485a instead of https://ja.wikipedia.org/wiki/鍗扮珷 is likely going to think the URL is fake or malicious. Only highly-technical users will see it as being a __more__ safe version of the link. This is why I think it should be hidden behind an option.
I agree. Honestly we don't even have to add it as an option. It's the internet that's broken :). Still it would be good to have it.
So should the goal be to match [scheme]://[anything].[any known TLD]/[more anything/nothing], make that a link, and let browsers deal with the rest? It still wouldn't match links like google.com but would match anything with http/https/ftp followed by anything in any language that looks a bit like a domain.
i.e. https://some.stupid.domain.ninja/馃樃馃樄馃樅 would match. Github agrees :)
Sounds like a good idea!
Sure, but still, a random non-technical Japanese user who sees https://ja.wikipedia.xn--org/wiki/-tl7oq485a instead of https://ja.wikipedia.org/wiki/鍗扮珷 is likely going to think the URL is fake or malicious. Only highly-technical users will see it as being a more safe version of the link.
That's why I suggest to show the https://ja.wikipedia.org/wiki/鍗扮珷 but but display https://ja.wikipedia.xn--org/wiki/-tl7oq485a on mouseover.
Honestly we don't even have to add it as an option. It's the internet that's broken :).
But by additionally displaying punycode (for non-ASCII URLs only), qTox could help to raise awareness for this kind of attacks making the internet a bit safer. :wink:
If we want to be easy to use and replace Skype, showing percent-encoded URLs for any non-ASCII alphabet domain doesn't seem like the right approach.
I think if the original URL is still clearly visible, offering an additional layer of protection from fake links would make qTox an even better alternative to Skype. :wink:
Most helpful comment
That's why I suggest to show the https://ja.wikipedia.org/wiki/鍗扮珷 but but display https://ja.wikipedia.xn--org/wiki/-tl7oq485a on mouseover.
But by additionally displaying punycode (for non-ASCII URLs only), qTox could help to raise awareness for this kind of attacks making the internet a bit safer. :wink:
I think if the original URL is still clearly visible, offering an additional layer of protection from fake links would make qTox an even better alternative to Skype. :wink: