Node: url.parse() port instead of hostname

Created on 16 Jun 2017  路  18Comments  路  Source: nodejs/node

  • Version: 6.9.2
  • Platform: macOS Sierra
  • Subsystem: url module
url.parse('localhost:8081').hostname

outputs:

"8081"
question url

Most helpful comment

Then you cannot trust url.parse() unless it's 100% strict URL.
You should always validate the URL passed to url.parse() before parsing. And fill in the missing protocol..

  • RFC standards may require the protocol portion but conventions and browsers do not.
    These are all considered valid: localhost:8081, www.foo.com, domain.com/path, etc..
  • url.parse() shouldn't check for : but :// for protocol portion.
  • also a URL starting with // should return null for protocol.
  • it's better that it doesn't force existing valid protocols such as http, ftp..
    It should allow any psuedo protocol. i.e. this is valid: twitter://
url.parse('cannot.trust/url')
{"protocol":"cannot.trust:","slashes":null,"auth":null,"host":"80","port":null,"hostname":"80","hash":null,"search":null,"query":null,"pathname":"/url/parse","path":"/url/parse","href":"cannot.trust:80/url/parse"}

host and hostname are both parsed as 80 馃憤
protocol is considered cannot.trust but href is cannot.trust:80/url/parse so it puts port after protocol... but then again port is null.

This forces me to write my own parser ..

All 18 comments

Hi @onury ,
i think that it's not properly a bug because you have to pass the protocol on input at parse function

url.parse('http://localhost:8081').hostname

The url module follow the RFC 3986, RFC 1808, RFC 2396 and in parse function the first element is considered as protocol part of the url see: https://github.com/nodejs/node/blob/master/lib/url.js#L206
So maybe what we need is a check if the url is right to be parsed or not.

@NickNaso I took a look on the implementation of the function, and url.parse will break the first part before a ':' as a protocol:

url.parse('a:b').protocol

returns
'a:'

So it's not checking for a pattern like 'http://' or any of the like, as it expects a protocol to be always present.

@aaneto The protocol doesn't have to be http, and in fact even malformed URLs without the double slash like http:localhost:8081 are allowed by browsers, etc.

Users are generally encouraged to use the newer and more standard-compliant WHATWG URL parser that's in Node.js v7.0.0+, which will parse the string in the following fashion:

> new url.URL('localhost:8081')
URL {
  href: 'localhost:8081',
  origin: 'null',
  protocol: 'localhost:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '8081',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }

Closing, as the logic behind url.parse's choice has been explained and a better alternative is posed.

Then you cannot trust url.parse() unless it's 100% strict URL.
You should always validate the URL passed to url.parse() before parsing. And fill in the missing protocol..

  • RFC standards may require the protocol portion but conventions and browsers do not.
    These are all considered valid: localhost:8081, www.foo.com, domain.com/path, etc..
  • url.parse() shouldn't check for : but :// for protocol portion.
  • also a URL starting with // should return null for protocol.
  • it's better that it doesn't force existing valid protocols such as http, ftp..
    It should allow any psuedo protocol. i.e. this is valid: twitter://
url.parse('cannot.trust/url')
{"protocol":"cannot.trust:","slashes":null,"auth":null,"host":"80","port":null,"hostname":"80","hash":null,"search":null,"query":null,"pathname":"/url/parse","path":"/url/parse","href":"cannot.trust:80/url/parse"}

host and hostname are both parsed as 80 馃憤
protocol is considered cannot.trust but href is cannot.trust:80/url/parse so it puts port after protocol... but then again port is null.

This forces me to write my own parser ..

You should always validate the URL passed to url.parse() before parsing. And fill in missing the missing protocol..
RFC standards may require the protocol portion but conventions and browsers do not.
These are all considered valid: localhost:8081, www.foo.com, domain.com/path, etc..

Browsers are different, in that they have a "base URL" (specifically, the URL of the current document) to fall back on. E.g. when a browser sees domain.com/path in the href attribute of an a element, given the current document uses https scheme, it automatically uses the base URL's scheme https; and if it is http, it uses http instead.

Node.js is different. There isn't a document.baseURI for Node.js, and as such we force the user to supply their own base through either url.resolve or new URL(input, base).

(And also, try using <a href="localhost:8080">link</a> in a browser. It will not go to http://localhost:8080/, but instead localhost:8080 where localhost is the scheme/protocol.)

url.parse() shouldn't check for : but :// for protocol portion.

This would make it impossible to parse, among others, magnet links.

also a URL starting with // should return null for protocol.

We already do:

> url.parse('//foo/bar')
Url {
  protocol: null,
  slashes: null,
  auth: null,
  host: null,
  port: null,
  hostname: null,
  hash: null,
  search: null,
  query: null,
  pathname: '//foo/bar',
  path: '//foo/bar',
  href: '//foo/bar' }

it's better that it doesn't force existing valid protocols such as http, ftp..
It should allow any psuedo protocol. i.e. this is valid: twitter://

We already support non-standard protocol in both parsers.

> url.parse('twitter://')
Url {
  protocol: 'twitter:',
  slashes: true,
  auth: null,
  host: '',
  port: null,
  hostname: '',
  hash: null,
  search: null,
  query: null,
  pathname: null,
  path: null,
  href: 'twitter://' }
> new url.URL('twitter://')
URL {
  href: 'twitter://',
  origin: 'null',
  protocol: 'twitter:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }

(also reopening...)

Thanks. I was suggesting that it's good that it allows psuedo protocols.

For URLs like localhost:8081, when I say convention; it's when you enter this in an address bar of a browser. It does go to http://localhost:8081. So it doesn't consider localhost as the protocol. But I'm not suggesting that url.parse() should do the same. It shouldn't set http as default. But rather, parse it as null.

@onury Which version are you using? On v8.1.2 it does parse the protocol as null:

> url.parse('cannot.trust/url')
Url {
  protocol: null,
  slashes: null,
  auth: null,
  host: null,
  port: null,
  hostname: null,
  hash: null,
  search: null,
  query: null,
  pathname: 'cannot.trust/url',
  path: 'cannot.trust/url',
  href: 'cannot.trust/url' }

EDIT: so does v6.11.0 and v4.8.3

Hi guys,
i tested url.parse('localhost:8081') over Node.js version 7.10.0 and 8.1.2 and the result is always the same:

> var p = url.parse('localhost:8081')
undefined
> p
Url {
  protocol: 'localhost:',
  slashes: null,
  auth: null,
  host: '8081',
  port: null,
  hostname: '8081',
  hash: null,
  search: null,
  query: null,
  pathname: null,
  path: null,
  href: 'localhost:8081' }
>

I try this specific case also on the browser with the same result in both Firefox and Chrome
schermata 2017-06-16 alle 11 18 11
schermata 2017-06-16 alle 11 20 57
It's my opinion that the method parse should has only the responsability to parse the url and not to validate it.

@joyeecheung v6.9.2

@NickNaso I agree, it shouldn't validate. but parse..

@NickNaso I agree, it shouldn't validate. but parse..

@onury Do you suggest it should be documented as a Note on "known behaviour"?

@refack Documenting this should be good and avoid mistake and reinforce what parse really do

@refack, @NickNaso,
At least...
But I'd insist this is not the correct behaviour.

It'd be more logical to determine if the string includes a protocol or not by simply checking for :// not :.

One argument above for this is that; URLs without the double slash like http:localhost:8081 are allowed by browsers. Well, localhost:8081 is also allowed and treated as http://localhost:8081 by browsers.

url.parse('http:localhost:8081') parses http: as protocol but hostname as null..

{"protocol":"http:","slashes":null,"auth":null,"host":null,"port":null,"hostname":null,"hash":null,"search":null,"query":null,"pathname":"localhost:8081","path":"localhost:8081","href":"http:localhost:8081"}

Conforming to RFCs but does it really? I believe protocol:hostname is not a valid format..
Which spec says path/name can come right after protocol?

url.parse() should know better.

@onury @NickNaso We have implemented WHATWG URL (URL in browsers) in Node.js, and url.parse() is more like a legacy API better left alone for compatibility reasons. If you do want the same kind of behavior as in the browsers, maybe just use new url.URL() instead? (although in this case I would consider the new URL() behavior rather than the type-in-the-address-bar behavior as "how browsers treat this URL").

Also the weird thing is...on my Mac:

> process.version
'v6.9.2'
> url.parse('cannot.trust/url')
Url {
  protocol: null,
  slashes: null,
  auth: null,
  host: null,
  port: null,
  hostname: null,
  hash: null,
  search: null,
  query: null,
  pathname: 'cannot.trust/url',
  path: 'cannot.trust/url',
  href: 'cannot.trust/url' }

@onury

Well, localhost:8081 is also allowed and treated as http://localhost:8081 by browsers.

Only in address bars, which has a human audience, not a machine audience. Any pathways for URL parsing that has a machine audience -- including new URL and <a> -- parse localhost: as the protocol part.

I believe protocol:hostname is not a valid format..

Agreed, but protocol:pathname is valid. Which is exactly how url.parse interprets http:localhost:8081.

Which spec says path/name can come right after protocol?

https://tools.ietf.org/html/rfc3986#section-3

The hier-part production has four branches, three of which do not have //.

url.parse() is behaving correctly in this case. There is no bug here. localhost:8181 a syntactically valid URL string with localhost: as the protocol and 8181 as the path. If this intention is to parse as an http:// URL, then the http:// protocol needs to be specified explicitly: url.parse('http://localhost:8181'), otherwise, the url.parse() implementation will have no way of knowing what the user's intent is. Browsers can safely assume user intent and assume that the user meant http:// because of the nature of what a Browser does. The more generic url.parse() cannot make such assumptions, nor should it.

http:localhost:8181 is not a valid absolute URL string. It will be parsed as a relative URL, meaning that the localhost:8181 is assumed to be part of the path segment. However, because users have commonly made this mistake, the WHATWG URL standard has accounted for it and opted to be lenient.

The following each represent correct behavior per the relevant standards, even tho they differ in result. The difference is that new url.URL() conforms to the WHATWG standard, while url.parse() conforms to the IETF RFC.

> new url.URL('http:localhost:8181')
URL {
  href: 'http://localhost:8181/',
  origin: 'http://localhost:8181',
  protocol: 'http:',
  username: '',
  password: '',
  host: 'localhost:8181',
  hostname: 'localhost',
  port: '8181',
  pathname: '/',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }
> url.parse('http:localhost:8181')
Url {
  protocol: 'http:',
  slashes: null,
  auth: null,
  host: null,
  port: null,
  hostname: null,
  hash: null,
  search: null,
  query: null,
  pathname: 'localhost:8181',
  path: 'localhost:8181',
  href: 'http:localhost:8181' }
>

The following is also correct per each of the relevant standards:

> new url.URL('localhost:8181')
URL {
  href: 'localhost:8181',
  origin: 'null',
  protocol: 'localhost:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '8181',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }
> url.parse('localhost:8181')
Url {
  protocol: 'localhost:',
  slashes: null,
  auth: null,
  host: '8181',
  port: null,
  hostname: '8181',
  hash: null,
  search: null,
  query: null,
  pathname: null,
  path: null,
  href: 'localhost:8181' }
>

Even in the browser, when using the URL API, you will get exactly the same results as when using the WHATWG URL API in Node.js.

There is no bug here. I recommend closing.

@TimothyGu, @jasnell, @joyeecheung
Fair enough.. Thanks.

Very frustrating and confusing behavior. You can't even parse elementary host:port URL. Seems the simplest replacement for this purpose is new URL('nodejs.org:1234', 'foo://'); It will work with protocol-including URLs as well: new URL('bar://nodejs.org:1234', 'foo://');

URL {
  href: 'bar://nodejs.org:1234',
  origin: 'null',
  protocol: 'bar:',
  username: '',
  password: '',
  host: 'nodejs.org:1234',
  hostname: 'nodejs.org',
  port: '1234',
  pathname: '',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }
Was this page helpful?
0 / 5 - 0 ratings

Related issues

mcollina picture mcollina  路  3Comments

cong88 picture cong88  路  3Comments

filipesilvaa picture filipesilvaa  路  3Comments

danialkhansari picture danialkhansari  路  3Comments

akdor1154 picture akdor1154  路  3Comments