A working ZooKeeper cluster can't be made on 19.03, and I think this is the culprit: https://github.com/NixOS/nixpkgs/pull/47241
The problem is that the ZK cluster is defined liked this:
server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888
I'm not 100% sure about the workings of this, and it's seriously badly documented (that is, not at all) by the ZK people, but it seems to fit what I experience.:
ZooKeeper knows which id it has, so it uses the hostname from the corresponding line in the configuration to figure out the external address to bind to. Except that since https://github.com/NixOS/nixpkgs/pull/47241 this will now resolve to 127.0.1.1
, making it bind the leader election port (3888) to that address, which makes it impossible for other nodes to connect to it:
$ netstat -nlp
[snip]
tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN -
tcp 0 0 127.0.1.1:3888 0.0.0.0:* LISTEN -
The ZK documentation explicitly says that it will bind to all interfaces for the client port (2181), which is what it's doing, but is vague on what happens with the leader election port (3888), but it does make sense to bind to this specific address, since the cluster communication can then be locked down to a specific separate network if need be.
What other distributions points the hostname to 127.0.1.1
? It seems quite unfortunate. It even makes it impossible to test e.g. whether a service is reachable from the outside by trying to connect to it through its real DNS name (as in curl myhost
vs curl localhost
).
ping @oxij (sorry, you made the change, so hoping you can give some insight into this)
Create a ZooKeeper cluster, and specify the cluster topology using hostnames instead of IP-addresses.
Please run nix-shell -p nix-info --run "nix-info -m"
and paste the
results.
Not practical (locked down network), but we've currently pinned to
{
"url": "https://github.com/NixOS/nixpkgs-channels/archive/50876481a0127ad885fcbfd48ab24bbacbc26395.tar.gz",
"rev": "50876481a0127ad885fcbfd48ab24bbacbc26395",
"date": "2019-03-10T23:40:18+01:00",
"sha256": "063q2jhi9lf6azbhlrn3cygpaa3n65n3d8g7c1s0vvsj8rxv8b80"
}
On 18.09 the netstat output is like this instead:
$ netstat -nlp
[snip]
tcp 0 0 172.20.40.43:3888 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN
Which works fine.
According to the Debian reference manual (https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_legacy_network_connection_and_configuration):
For a system with a permanent IP address, that permanent IP address should be used here instead of 127.0.1.1.
The implementation in modules/config/networking.nix
fails to take this into account as well.
For a system with a permanent IP address, that permanent IP address should be used here instead of 127.0.1.1.
I guess, the simplest solution is to add something like networking.permanentIPv4
defaulting to 127.0.1.1
and use that in /etc/hosts
generator.
A more complicated solution is to alias all IP addresses from all the statically configured interfaces to networking.hostName
in /etc/hosts
generator but I suspect that it would have unexpected side effects (and won't really help in your use case).
I guess, the simplest solution is to add something like
networking.permanentIPv4
defaulting to127.0.1.1
and use that in/etc/hosts
generator.
Yeah, but having a primary IP-address is also kinda weird, in a world with multiple NIC's per host. On the other hand aliasing all statically configured addresses also seem like it might cause trouble, as you allure to.
We had to work around it with etcd as well, apparently, so it's not just ZooKeeper.
What's the purpose of the 127.0.1.1 $hostname
line? The sources I've found states the line is added to fix a quirk in gnome.
(thanks for the reply)
What's the purpose of the
127.0.1.1 $hostname
line? The sources I've found states the line is added to fix a quirk in gnome.
A lot of stuff needs hostname aliased to something. For instance, without that line on a DHCP-configured NixOS host with DNS a server that doesn't automap DHCP-issued hostnames to their addresses (e.g. most commercial home routers) firefox and emacs will hang for a while on startup trying to resolve hostname via DNS, some SMTP clients will hang and then complain about a wrong EHLO line, etc etc. See #47241, #36261 and all the linked PRs and issues.
A lot of stuff needs hostname aliased to something. For instance, without that line on a DHCP-configured NixOS host with DNS a server that doesn't automap DHCP-issued hostnames to their addresses (e.g. most commercial home routers) firefox and emacs will hang for a while on startup trying to resolve hostname via DNS, some SMTP clients will hang and then complain about a wrong EHLO line, etc etc. See #47241, #36261 and all the linked PRs and issues.
I see. That's quite a bit of possible issues. Too bad this breaks other stuff instead.
Four our use case we could just stop setting hostName
, since we're also running DHCP. But this still breaks for people who use static network configuration.
Maybe we could introduce a flag to not add the 127.0.1.1
-line? Or (I suppose) people could remove the "127.0.1.1" entry from networking.hosts somehow (I'm not sure how to remove elements from and attrSet using the module system).
I have no idea what's the best solution here, but I would be sad to see this go unfixed.
+1 for making that 127.0.1.1
entry configurable.
Personally I use dnsmasq
on my NixOS-based router. This 127.0.1.1
results in that I cannot determine IP address for my router from different devices in the network, because dnsmasq
just replies with 127.0.1.1
to requests about its hostname instead of the real ip address (despite it is configured).
for hdfs
it is undesirable to have 127.0.1.1
in /etc/hosts even if IPs are used everywhere in config files instead of hostnames:
Ran into this issue using dnsmasq as a DNS server for a LAN. dnsmasq
will read from hosts
and then provide 127.0.1.1
as the address of the host. Which will break any lookup of that host on the LAN.
What's the purpose of the
127.0.1.1 $hostname
line? The sources I've found states the line is added to fix a quirk in gnome.
FWIW Pop OS (a gnome desktop) contains a line of the pattern:
~
127.0.1.1 $hostname.localdomain $hostname
~
The gnome3
test in nixos still succeeds with 127.0.1.1
removed. Tho that doesn't cover enough of gnome to hit whatever quirk that may require this.
Perhaps related to systemd-resolver
shenanigans?
This also breaks certmgr in certain situations (kubernetes).
Thank you for your contributions.
This has been automatically marked as stale because it has had no activity for 180 days.
If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.
Here are suggestions that might help resolve this more quickly:
Most helpful comment
I see. That's quite a bit of possible issues. Too bad this breaks other stuff instead.
Four our use case we could just stop setting
hostName
, since we're also running DHCP. But this still breaks for people who use static network configuration.Maybe we could introduce a flag to not add the
127.0.1.1
-line? Or (I suppose) people could remove the "127.0.1.1" entry from networking.hosts somehow (I'm not sure how to remove elements from and attrSet using the module system).I have no idea what's the best solution here, but I would be sad to see this go unfixed.