Nixpkgs: ZooKeeper clustering broken on 19.03 due to /etc/hosts change

Created on 13 Mar 2019  路  12Comments  路  Source: NixOS/nixpkgs

Issue description

A working ZooKeeper cluster can't be made on 19.03, and I think this is the culprit: https://github.com/NixOS/nixpkgs/pull/47241

The problem is that the ZK cluster is defined liked this:

server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888

I'm not 100% sure about the workings of this, and it's seriously badly documented (that is, not at all) by the ZK people, but it seems to fit what I experience.:

ZooKeeper knows which id it has, so it uses the hostname from the corresponding line in the configuration to figure out the external address to bind to. Except that since https://github.com/NixOS/nixpkgs/pull/47241 this will now resolve to 127.0.1.1, making it bind the leader election port (3888) to that address, which makes it impossible for other nodes to connect to it:

$ netstat -nlp
[snip]
tcp        0      0 0.0.0.0:2181            0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.1.1:3888          0.0.0.0:*               LISTEN      - 

The ZK documentation explicitly says that it will bind to all interfaces for the client port (2181), which is what it's doing, but is vague on what happens with the leader election port (3888), but it does make sense to bind to this specific address, since the cluster communication can then be locked down to a specific separate network if need be.

What other distributions points the hostname to 127.0.1.1? It seems quite unfortunate. It even makes it impossible to test e.g. whether a service is reachable from the outside by trying to connect to it through its real DNS name (as in curl myhost vs curl localhost).

ping @oxij (sorry, you made the change, so hoping you can give some insight into this)

Steps to reproduce

Create a ZooKeeper cluster, and specify the cluster topology using hostnames instead of IP-addresses.

Technical details

Please run nix-shell -p nix-info --run "nix-info -m" and paste the
results.

Not practical (locked down network), but we've currently pinned to

{
  "url": "https://github.com/NixOS/nixpkgs-channels/archive/50876481a0127ad885fcbfd48ab24bbacbc26395.tar.gz",
  "rev": "50876481a0127ad885fcbfd48ab24bbacbc26395",
  "date": "2019-03-10T23:40:18+01:00",
  "sha256": "063q2jhi9lf6azbhlrn3cygpaa3n65n3d8g7c1s0vvsj8rxv8b80"
}
stale

Most helpful comment

A lot of stuff needs hostname aliased to something. For instance, without that line on a DHCP-configured NixOS host with DNS a server that doesn't automap DHCP-issued hostnames to their addresses (e.g. most commercial home routers) firefox and emacs will hang for a while on startup trying to resolve hostname via DNS, some SMTP clients will hang and then complain about a wrong EHLO line, etc etc. See #47241, #36261 and all the linked PRs and issues.

I see. That's quite a bit of possible issues. Too bad this breaks other stuff instead.
Four our use case we could just stop setting hostName, since we're also running DHCP. But this still breaks for people who use static network configuration.

Maybe we could introduce a flag to not add the 127.0.1.1-line? Or (I suppose) people could remove the "127.0.1.1" entry from networking.hosts somehow (I'm not sure how to remove elements from and attrSet using the module system).

I have no idea what's the best solution here, but I would be sad to see this go unfixed.

All 12 comments

On 18.09 the netstat output is like this instead:
$ netstat -nlp [snip] tcp 0 0 172.20.40.43:3888 0.0.0.0:* LISTEN - tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN

Which works fine.

According to the Debian reference manual (https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_legacy_network_connection_and_configuration):

For a system with a permanent IP address, that permanent IP address should be used here instead of 127.0.1.1.

The implementation in modules/config/networking.nix fails to take this into account as well.

For a system with a permanent IP address, that permanent IP address should be used here instead of 127.0.1.1.

I guess, the simplest solution is to add something like networking.permanentIPv4 defaulting to 127.0.1.1 and use that in /etc/hosts generator.

A more complicated solution is to alias all IP addresses from all the statically configured interfaces to networking.hostName in /etc/hosts generator but I suspect that it would have unexpected side effects (and won't really help in your use case).

I guess, the simplest solution is to add something like networking.permanentIPv4 defaulting to 127.0.1.1 and use that in /etc/hosts generator.

Yeah, but having a primary IP-address is also kinda weird, in a world with multiple NIC's per host. On the other hand aliasing all statically configured addresses also seem like it might cause trouble, as you allure to.

We had to work around it with etcd as well, apparently, so it's not just ZooKeeper.

What's the purpose of the 127.0.1.1 $hostname line? The sources I've found states the line is added to fix a quirk in gnome.

(thanks for the reply)

What's the purpose of the 127.0.1.1 $hostname line? The sources I've found states the line is added to fix a quirk in gnome.

A lot of stuff needs hostname aliased to something. For instance, without that line on a DHCP-configured NixOS host with DNS a server that doesn't automap DHCP-issued hostnames to their addresses (e.g. most commercial home routers) firefox and emacs will hang for a while on startup trying to resolve hostname via DNS, some SMTP clients will hang and then complain about a wrong EHLO line, etc etc. See #47241, #36261 and all the linked PRs and issues.

A lot of stuff needs hostname aliased to something. For instance, without that line on a DHCP-configured NixOS host with DNS a server that doesn't automap DHCP-issued hostnames to their addresses (e.g. most commercial home routers) firefox and emacs will hang for a while on startup trying to resolve hostname via DNS, some SMTP clients will hang and then complain about a wrong EHLO line, etc etc. See #47241, #36261 and all the linked PRs and issues.

I see. That's quite a bit of possible issues. Too bad this breaks other stuff instead.
Four our use case we could just stop setting hostName, since we're also running DHCP. But this still breaks for people who use static network configuration.

Maybe we could introduce a flag to not add the 127.0.1.1-line? Or (I suppose) people could remove the "127.0.1.1" entry from networking.hosts somehow (I'm not sure how to remove elements from and attrSet using the module system).

I have no idea what's the best solution here, but I would be sad to see this go unfixed.

+1 for making that 127.0.1.1 entry configurable.
Personally I use dnsmasq on my NixOS-based router. This 127.0.1.1 results in that I cannot determine IP address for my router from different devices in the network, because dnsmasq just replies with 127.0.1.1 to requests about its hostname instead of the real ip address (despite it is configured).

for hdfs it is undesirable to have 127.0.1.1 in /etc/hosts even if IPs are used everywhere in config files instead of hostnames:

Ran into this issue using dnsmasq as a DNS server for a LAN. dnsmasq will read from hosts and then provide 127.0.1.1 as the address of the host. Which will break any lookup of that host on the LAN.

What's the purpose of the 127.0.1.1 $hostname line? The sources I've found states the line is added to fix a quirk in gnome.

FWIW Pop OS (a gnome desktop) contains a line of the pattern:

~
127.0.1.1 $hostname.localdomain $hostname
~

The gnome3 test in nixos still succeeds with 127.0.1.1 removed. Tho that doesn't cover enough of gnome to hit whatever quirk that may require this.

Perhaps related to systemd-resolver shenanigans?

This also breaks certmgr in certain situations (kubernetes).

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.
Was this page helpful?
0 / 5 - 0 ratings

Related issues

thoughtpolice picture thoughtpolice  路  71Comments

ttuegel picture ttuegel  路  98Comments

nico202 picture nico202  路  70Comments

grahamc picture grahamc  路  77Comments

danykey picture danykey  路  64Comments