Scylla: nodetool status returns wrong IPv6 addresses

Created on 13 Feb 2020 · 28Comments · Source: scylladb/scylla

i'm running with 3.3 (ami-0b19be243453bd8e8 on eu-west-1) - 3.3.rc1-0.20200209.0d0c1d43188

both AWS console and local network manager show the same IPv6 address (and it is pingable using ping6):

2a05:d018:223:f00:b111:3a2:aa5f:2295 is the correct IPv6 address of my instance

ip addr  | grep inet6
inet6 ::1/128 scope host 
inet6 2a05:d018:223:f00:b111:3a2:aa5f:2295/128 scope global dynamic 
inet6 fe80::88d:a6ff:fe32:6cce/64 scope link

while nodetool status returns (2a05:d018:223:f00:b111:ffa2:aa5f:ff95) something else:

nodetool status
Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                Load       Tokens       Owns    Host ID                               Rack
UN  2a05:d018:223:f00:1c55:d40e:5a7d:ffb7  461.13 KB  256          ?       8fd87d4c-9fb6-44d9-b016-48de9529f453  UNKNOWN_RACK
UN  2a05:d018:223:f00:b111:ffa2:aa5f:ff95  572.09 KB  256          ?       05a39bc1-3fbb-481a-93c8-e96b1b1f2222  UNKNOWN_RACK
UN  2a05:d018:223:f00:ff85:be21:de55:ffbf  461.97 KB  256          ?       bdac46b6-0069-49ba-8acd-ff6db13a2ddd  UNKNOWN_RACK

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

used to get a pcap using tshark (will try to attach to here, or will provide a link to some S3 storage.

bug ipv6

Source

fgelcer

All 28 comments

capture.txt

fgelcer on 13 Feb 2020

capture.txt

capture.txt is the pcap captured using tshark on one of the nodes

fgelcer on 13 Feb 2020

i have tested the same scenario using 3.2 and it doesn't show the issue

fgelcer on 13 Feb 2020

Lets try to reproduce on monday morning and get the machine up - @elcallio will be able to check this ount on the machine

slivne on 14 Feb 2020

What machine? And check out what? Since server address is not exactly automatically resolved, I would assume that whatever is showing in nodetool is also in scylla.yaml?
ifconfig?
netstat -nlp?
raw api output?

elcallio on 17 Feb 2020

What machine? And check out what? Since server address is not exactly automatically resolved, I would assume that whatever is showing in nodetool is also in scylla.yaml?
ifconfig?
netstat -nlp?
raw api output?

wrong, in scylla.yaml we see the correct IPv6 address, hence the cluster is up and running.
I can create an environment for you, from one of our manager sanity tests, if you will.

fgelcer on 17 Feb 2020

Yes please.

elcallio on 17 Feb 2020

Yes please.

nodetool status

Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                Load       Tokens       Owns    Host ID                               Rack
UN  2a05:d018:223:f00:ff95:6149:ffb5:3266  433.04 KB  256          ?       a2c865a5-1309-4cc0-a4cb-29c7492a18a6  UNKNOWN_RACK
UN  2a05:d018:223:f00:ff99:921c:2362:ffd6  502.62 KB  256          ?       39d5ffc2-5340-4c0c-9357-9963be6a1d3b  UNKNOWN_RACK
UN  2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f  547.12 KB  256          ?       eb70f0e1-d880-444d-9bec-121db27d0601  UNKNOWN_RACK

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
This EC2 instance is optimized for Scylla.

[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ ip addr | grep inet6
    inet6 ::1/128 scope host 
    inet6 2a05:d018:223:f00:97af:f4d9:eac2:6a0f/128 scope global dynamic 
    inet6 fe80::8c1:15ff:fe82:65f0/64 scope link 
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ #2a05:d018:223:f00:97af:f4d9:eac2:6a0f
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ #2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ grep 6a0f /etc/scylla/scylla.yaml 
          - seeds: "2a05:d018:223:f00:97af:f4d9:eac2:6a0f"
listen_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
broadcast_rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
prometheus_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f

fgelcer on 17 Feb 2020

@elcallio , node can be accessed on 34.241.115.182

fgelcer on 17 Feb 2020

I don't see what is wrong here.

~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.0.164.48  netmask 255.255.0.0  broadcast 10.0.255.255
        inet6 2a05:d018:223:f00:97af:f4d9:eac2:6a0f  prefixlen 128  scopeid 0x0<global>
        inet6 fe80::8c1:15ff:fe82:65f0  prefixlen 64  scopeid 0x20<link>
...
~]$ grep 2a05:d018:223:f00:97af:f4d9:eac2:6a0f /etc/scylla/scylla.yaml 
          - seeds: "2a05:d018:223:f00:97af:f4d9:eac2:6a0f"
listen_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
broadcast_rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
prometheus_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f

~]$ nodetool status
Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                Load       Tokens       Owns    Host ID                               Rack
UN  2a05:d018:223:f00:ff95:6149:ffb5:3266  507.87 KB  256          ?       a2c865a5-1309-4cc0-a4cb-29c7492a18a6  UNKNOWN_RACK
UN  2a05:d018:223:f00:ff99:921c:2362:ffd6  507.71 KB  256          ?       39d5ffc2-5340-4c0c-9357-9963be6a1d3b  UNKNOWN_RACK
UN  2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f  622.49 KB  256          ?       eb70f0e1-d880-444d-9bec-121db27d0601  UNKNOWN_RACK

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

The eth0 interface is 2a05:d018:223:f00:97af:f4d9:eac2:6a0f, scylla is bound to the address, and it shows up in the output.

All local info says consistent.

elcallio on 17 Feb 2020

I don't see what is wrong here.

~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.0.164.48  netmask 255.255.0.0  broadcast 10.0.255.255
        inet6 2a05:d018:223:f00:97af:f4d9:eac2:6a0f  prefixlen 128  scopeid 0x0<global>
        inet6 fe80::8c1:15ff:fe82:65f0  prefixlen 64  scopeid 0x20<link>
...
~]$ grep 2a05:d018:223:f00:97af:f4d9:eac2:6a0f /etc/scylla/scylla.yaml 
          - seeds: "2a05:d018:223:f00:97af:f4d9:eac2:6a0f"
listen_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
broadcast_rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
prometheus_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f

~]$ nodetool status
Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                Load       Tokens       Owns    Host ID                               Rack
UN  2a05:d018:223:f00:ff95:6149:ffb5:3266  507.87 KB  256          ?       a2c865a5-1309-4cc0-a4cb-29c7492a18a6  UNKNOWN_RACK
UN  2a05:d018:223:f00:ff99:921c:2362:ffd6  507.71 KB  256          ?       39d5ffc2-5340-4c0c-9357-9963be6a1d3b  UNKNOWN_RACK
UN  2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f  622.49 KB  256          ?       eb70f0e1-d880-444d-9bec-121db27d0601  UNKNOWN_RACK

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

The eth0 interface is 2a05:d018:223:f00:97af:f4d9:eac2:6a0f, scylla is bound to the address, and it shows up in the output.

All local info says consistent.

nope, look closely... they are similar, but the addresses are different:

ffaf:ffd9:ffc2:6a0f
97af:f4d9:eac2:6a0f

fgelcer on 17 Feb 2020

Sorry, missed that (confused by the talk of aa5f:2295 above).
Looks like somehow the accumulator mask in print routine is not cleared...?

elcallio on 17 Feb 2020

So, a simple test to parse and re-print the addresses in question yields no errors. There has been no changes in the address formatting code since 2019-12-04, which should be 3.3 no?
On the other hand, the pattern above does not indicate any consistent "leak" of byte mask/accumulator, so I am not exactly eager to yell "compiler error" either.
But I cannot repro any issue parsing/printing locally. And you say this node/cluster _works_ properly, so we should assume all addresses are ok in memory? (netstat supports this).
Which gcc was this build with? Did we change gcc version somewhere?
I think _maybe_ it might be different gcc doing a wild sign extension/type promotion that does not happen with mine... It would explain this (need high bit in high nibble -> 9)

elcallio on 17 Feb 2020

I've pushed a patch to https://github.com/elcallio/scylla.git. Is it possible for you to build this with the same compiler as the image was built with and test?

elcallio on 17 Feb 2020

I've pushed a patch to https://github.com/elcallio/scylla.git. Is it possible for you to build this with the same compiler as the image was built with and test?

in general lines, the test uses an AMI pre-created with specific version, or in my case, i'm calling latest AMI created for 3.3.
I will check later how i can create an AMI with your patch e get a ride to it.

fgelcer on 17 Feb 2020

@fgelcer ping

slivne on 10 Mar 2020

@slivne @elcallio , after lots of fighting I succeeded to build an AMI with @elcallio code, but is looks bogus:
/tmp/tmp.TVQD8HscRO/cassandra-env.sh: line 90: [: -ge: unary operator expected

then it fails the whole SCT execution as it was Unable to get nodetool status

fgelcer on 30 Mar 2020

@fgelcer I think you a scylla-tools-java that was broken - can you please provide the commit you used for scylla-tools-java

So if you used a version that is scylladb/scylla-tools-java@6d76e51472ae3350cbe9b9adc2134eec616d9318 or scylladb/scylla-tools-java@d3e5bbe208e1fd3dea7f4bdf82489c14b2f3e210 then its clear that it why it was broken.

there is a fix in master to fix this scylladb/scylla-tools-java@2f73d3ae37718ec9f7f1e1cd4bffc96aa4db5d9c

slivne on 2 Apr 2020

after scylla-tools-java were fixed, i'm building a new AMI and hopefully tomorrow it will be ready to give a try

fgelcer on 6 Apr 2020

@slivne / @elcallio , I confirm now (after lots of efforts to create and execute the AMI with @elcallio 's commit) that it works (actually the test failed with some unrelated issues).
@elcallio , can you please patch your change?
Thank you.

fgelcer on 7 Apr 2020

@slivne was this fix backported to 4.0? Because of I receive it with 4.0.rc3-0.20200501.eee4c00e29 version.

Gossip info also returns wrong IP (like nodetool status)

juliayakovlev on 11 May 2020

@slivne was this fix backported to 4.0? Because of I receive it with 4.0.rc3-0.20200501.eee4c00e29 version.

Gossip info also returns wrong IP (like nodetool status)

nope, this commit isn't in branch-4.0

fgelcer on 11 May 2020

Backported to 4.0, 3.3 (4.1 is safe).