Got: URI malformed when trying to get URL with URL-encoded chars

Created on 17 Nov 2017  Â·  5Comments  Â·  Source: sindresorhus/got

Hello,

Catched URI malformed error in new "got", when I trying to send request to URL with URL-encoded chars.
URL examples:
https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/
https://www.kinopoisk.ru/news/keyword/%C7%E2%E5%E7%E4%ED%FB%E5+%E2%EE%E9%ED%FB/

nodejs: 9.2.0
got: 8.0.0

Failed code:

const got = require('got');

(async () => {
    try {
        const response = await got('https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/');
        console.log(response);
    } catch (error) {
        console.log(error);
    }
})();
URIError: URI malformed
    at decodeURI (<anonymous>)
    at module.exports (/Users/kirill-m/git/test/node_modules/normalize-url/index.js:87:21)
    at /Users/kirill-m/git/test/node_modules/cacheable-request/src/index.js:43:16
    at get (/Users/kirill-m/git/test/node_modules/got/index.js:98:20)
    at Promise.resolve.then.size (/Users/kirill-m/git/test/node_modules/got/index.js:274:5)
    at <anonymous>

Broken at this commit: https://github.com/sindresorhus/got/commit/3c7920507fae88a5f53d0640b5116fa34a5ed829

Any ideas?

enhancement ✭ help wanted ✭

All 5 comments

Explanation of failure:

This website is windows-1252 encoded which is unsupported by js-native decode utilities which operate on and assume UTF-8 input. The encoded portion of the provided URLs contain sequences that are invalid for UTF-8 encoding, and as a result cannot be decoded properly. This error can be reproduced in any browser console or repl properly implementing the spec (e.g. repl.it) which is expecting UTF-8.

Take %D2%E0%EB%EB%E8%ED as an example which represents Òàëëèí in windows-1252 encoding. The equivalent in UTF-8 would be %C3%92%C3%A0%C3%AB%C3%AB%C3%A8%C3%AD giving a URL of https://www.kinopoisk.ru/community/city/%C3%92%C3%A0%C3%AB%C3%AB%C3%A8%C3%AD/. Unfortunately, this URL won't work due to the encoding of the website.


Why this fails now, but not before:

The introduction of caching via lukechilds/cacheable-request introduced the package sindresorhus/normalize-url which uses decodeURI internally. This module _could_ perform a best-effort decoding - falling back to the encoded value - when the string is not UTF-8 encoded. This would allow URLs that happen to be encoded unexpectedly to process successfully.

I don't think a fix, if any, would be applied here directly in Got.

According to RFC 3986, UTF-8 encoding of URLs is spec and a requirement. The widely used Express will also throw on non UTF-8 encoded URLs. Got throwing is now enforcing URLs to be spec compliant before making a request, which isn't necessarily a bad thing.

Thanks for elaborating @brandon93s. I think the correct fix here is to detect the case early in Got and throw a user-friendly error about the URL having an invalid encoding.

@sindresorhus Are we okay with a brute force try...catch around a decodeURI early-ish on in normalizeArguments to catch any potential errors with a user-friendly message? Any decodeURI failure will present an issue, so we might as well check upfront and inform the consumer!

Glad to implement...

Yes

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tkoelpin picture tkoelpin  Â·  3Comments

khizarsonu picture khizarsonu  Â·  3Comments

carvallegro picture carvallegro  Â·  4Comments

lukechu10 picture lukechu10  Â·  3Comments

AxelTerizaki picture AxelTerizaki  Â·  3Comments